L3.3-70B-Animus-V12.0 — Quantized (compressed-tensors for vLLM)
This repository hosts quantized runtime builds of
Darkhn/Magistral-2509-24B-Animus-V12.0, repackaged for vLLM using the compressed-tensors format.
TL;DR
• Quantized package intended for vLLM with--quantization compressed-tensors.
• Two published branches: W4A16 (INT4/A16) and W8A16 (INT8/A16).
• Weight-only quantization; see “Quantization recipe” below.
Revisions & Branches
The
mainbranch is a landing page (model card + links). All runnable artifacts live under per-revision branches.
- main — placeholder / landing page
- W4A16 — 4-bit weights / 16-bit activations builds and runtime assets
- W8A16 — 8-bit weights / 16-bit activations builds
Quick links
- main: https://huggingface.co/TheHouseOfTheDude/L3.3-70B-Animus-V12.0_Compressed-Tensors/tree/main
- W4A16: https://huggingface.co/TheHouseOfTheDude/L3.3-70B-Animus-V12.0_Compressed-Tensors/tree/W4A16
- W8A16: https://huggingface.co/TheHouseOfTheDude/L3.3-70B-Animus-V12.0_Compressed-Tensors/tree/W8A16
What’s inside (per revision)
- Sharded quantized weights (*.safetensors) + index (model.safetensors.index.json)
- config.jsonwith compressed-tensors metadata (- weight_format,- quantization,- quantization_config, etc.)
- Tokenizer artifacts (tokenizer.json,tokenizer.model, merges/vocab if applicable)
- Optional: chat_template.jinja(inherits the parent finetune’s chat style)
Exact files vary by branch; see the Files and versions tab.
Quantization recipe (reference)
Shared choices
- Format: compressed-tensors (vLLM runtime), weight-only quantization
- Group size: 128 (typical for W4A16)
- Ignored layers: lm_headkept in higher precision
- Calibration: prose-heavy text (WikiText-style), seq_len ≈ 512–1024, tokenized with the parent chat template
- Export: save_compressed=Trueto embed compressed-tensors metadata for vLLM
Branch specifics
- W4A16 — INT4 weights / 16-bit activations (BF16/FP16 at runtime)
- W8A16 — INT8 weights / 16-bit activations (higher memory, extra stability on some workloads)
Refer to each branch’s
config.jsonfor the exact parameters used.
Quickstart — vLLM (compressed-tensors)
Install vLLM (recent version recommended):
pip install vllm
Serve (adjust to your hardware):
CUDA_VISIBLE_DEVICES=0,1,2,3 \
vllm serve TheHouseOfTheDude/L3.3-70B-Animus-V12.0_Compressed-Tensors \
  --quantizat
Model tree for TheHouseOfTheDude/L3.3-70B-Animus-V12.0_Compressed-Tensors
Base model
mistralai/Mistral-Small-3.1-24B-Base-2503
				Finetuned
	
	
mistralai/Magistral-Small-2509
						
				Finetuned
	
	
Darkhn/Magistral-2509-24B-Animus-V12.0
						