Behemoth-R1-123B-v2 — Quantized (compressed-tensors for vLLM)
Revisions & Branches
- main — placeholder landing branch. The canonical README lives here; model files may be minimal.
- NVFP4 - 4-bit weights / 4-bit activations (but acts like 16-bit activations)
- W4A16 — Symmetrical AWQ 4‑bit weights / 16‑bit activations builds and related assets are published under this revision.
- W8A16 — Symmetrical AWQ 8‑bit weights / 16‑bit activations builds and related assets are published under this revision.
- W8A8-FP8_BLOCK — 8‑bit weights / 8‑bit activations, FP8 quality but BLOCK style, to use Cutlas on Blackwell SM12.0 (Needs latest VLLM)
🔗 Quick links:
Browsemain· BrowseNFVP4· BrowseW4A16· BrowseW8A16· BrowseW8A8-FP8_BLOCKThis repository hosts multiple quantizations of the finetuned parent model for vLLM using the compressed-tensors runtime format.
This repository provides quantized packages of
TheDrummer/Behemoth-R1-123B-v2 (a finetune of mistralai/Mistral-Large-Instruct-2411), packaged for vLLM using compressed-tensors.
TL;DR
- This repo is quantized (e.g., AWQ W4A16, AWQ W4A16_ASYM, and INT8 W8A16) for vLLM.
- Load with vLLM using
--quantization compressed-tensors(select the branch with your desired quant).- Typical AWQ recipe: group_size=128, keep
lm_headin higher precision; uses the upstream Mistral‑Instruct chat template.
Repository Contents
- Quantized weights in sharded
.safetensors(model-00001-of-XXXXX.safetensors+model.safetensors.index.json) config.jsonwith compressed-tensors metadata- Tokenizer artifacts (e.g.,
tokenizer.json,tokenizer.model) - (If present)
chat_template.jinja - This
README.md
Exact file list may vary by release; see Files and versions.
Lineage
- Base model: mistralai/Mistral-Large-Instruct-2411
- Finetuned parent: TheDrummer/Behemoth-R1-123B-v2
- This repo: Quantized child of the finetune (compressed-tensors for vLLM)
Quickstart — vLLM (compressed-tensors)
Install vLLM (use a recent version):
pip install vllm
Serve the quantized model (adjust parallelism to your hardware):
# Example: tensor parallel across 8 GPUs
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 vllm serve TheHouseOfTheDude/Behemoth-R1-123B-v2_Compressed-Tensors --quantization compressed-tensors --tensor-parallel-size 8 --max-model-len 32768 --gpu-memory-utilization 0.70 --dtype bfloat16 # or float16 on GPUs without strong BF16
Query via Chat Completions:
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "TheHouseOfTheDude/Behemoth-R1-123B-v2_Compressed-Tensors",
"messages": [
{"role":"system","content":"You are Behemoth, helpful, precise, and safe."},
{"role":"user","content":"Outline a retrieval pipeline for legal documents."}
],
"max_tokens": 512,
"temperature": 0.7,
"top_p": 0.95
}'
Note:
compressed-tensorsis a vLLM runtime format. Loading this artifact directly in vanilla 🤗 Transformers is not supported; use vLLM for inference. If you need Transformers inference, use a different export (e.g., GPTQ/AWQ.safetensorscompatible with Transformers) or full‑precision weights.
Prompting / Chat Template
This package inherits the Mistral‑Instruct chat conventions from its parent finetune. If a chat_template.jinja is present, it is applied automatically by apply_chat_template within serving stacks that support it.
Tips
- Provide a concise system role.
- Structure multi‑step user prompts explicitly.
- For tool use, include clear schemas and results to minimize hallucinations.
Recommended Generation Settings
Starting points (tune for your latency/quality targets):
temperature: 0.4–0.9 (0.6–0.8 common)top_p: 0.9–0.95max_new_tokens: 256–2048+- Optional
repetition_penalty: 1.05–1.15 - Enable vLLM batching/scheduling features for throughput.
Hardware Guidance
- 123B is large; multi‑GPU with tensor parallelism is recommended.
- Quantization reduces weights memory; KV cache (activations) still dominates at long context. Adjust
--max-model-lenand batch size accordingly. - Use BF16 where supported; otherwise FP16.
- CUDA Graphs can help if stable in your stack.
Evaluation & Safety
- No official benchmark set is included; evaluate on your tasks before production.
- Apply content safety, guardrails, and human review for high‑stakes use cases.
License & Usage
This distribution inherits licenses/restrictions of:
- mistralai/Mistral-Large-Instruct-2411 (base)
- TheDrummer/Behemoth-R1-123B-v2 (finetune)
Using this model implies acceptance of the upstream terms.
Changelog
- v2 (current) — Quantized releases (AWQ W4A16_ASYM and INT8 W8A16) under TheHouseOfTheDude.
Links
- Base: mistralai/Mistral-Large-Instruct-2411
- Finetune parent: TheDrummer/Behemoth-R1-123B-v2
- This repo: TheHouseOfTheDude/Behemoth-R1-123B-v2_Compressed-Tensors
Model tree for TheHouseOfTheDude/Behemoth-R1-123B-v2_Compressed-Tensors
Base model
mistralai/Mistral-Large-Instruct-2411