Behemoth-R1-123B-v2 — Quantized (compressed-tensors for vLLM)

Revisions & Branches

  • mainplaceholder landing branch. The canonical README lives here; model files may be minimal.
  • NVFP4 - 4-bit weights / 4-bit activations (but acts like 16-bit activations)
  • W4A16 — Symmetrical AWQ 4‑bit weights / 16‑bit activations builds and related assets are published under this revision.
  • W8A16 — Symmetrical AWQ 8‑bit weights / 16‑bit activations builds and related assets are published under this revision.
  • W8A8-FP8_BLOCK — 8‑bit weights / 8‑bit activations, FP8 quality but BLOCK style, to use Cutlas on Blackwell SM12.0 (Needs latest VLLM)

🔗 Quick links:
Browse main · Browse NFVP4 · Browse W4A16 · Browse W8A16 · Browse W8A8-FP8_BLOCK

This repository hosts multiple quantizations of the finetuned parent model for vLLM using the compressed-tensors runtime format.

This repository provides quantized packages of
TheDrummer/Behemoth-R1-123B-v2 (a finetune of mistralai/Mistral-Large-Instruct-2411), packaged for vLLM using compressed-tensors.

TL;DR

  • This repo is quantized (e.g., AWQ W4A16, AWQ W4A16_ASYM, and INT8 W8A16) for vLLM.
  • Load with vLLM using --quantization compressed-tensors (select the branch with your desired quant).
  • Typical AWQ recipe: group_size=128, keep lm_head in higher precision; uses the upstream Mistral‑Instruct chat template.

Repository Contents

  • Quantized weights in sharded .safetensors (model-00001-of-XXXXX.safetensors + model.safetensors.index.json)
  • config.json with compressed-tensors metadata
  • Tokenizer artifacts (e.g., tokenizer.json, tokenizer.model)
  • (If present) chat_template.jinja
  • This README.md

Exact file list may vary by release; see Files and versions.


Lineage


Quickstart — vLLM (compressed-tensors)

Install vLLM (use a recent version):

pip install vllm

Serve the quantized model (adjust parallelism to your hardware):

# Example: tensor parallel across 8 GPUs
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 vllm serve TheHouseOfTheDude/Behemoth-R1-123B-v2_Compressed-Tensors   --quantization compressed-tensors   --tensor-parallel-size 8   --max-model-len 32768   --gpu-memory-utilization 0.70   --dtype bfloat16        # or float16 on GPUs without strong BF16

Query via Chat Completions:

curl http://localhost:8000/v1/chat/completions   -H "Content-Type: application/json"   -d '{
    "model": "TheHouseOfTheDude/Behemoth-R1-123B-v2_Compressed-Tensors",
    "messages": [
      {"role":"system","content":"You are Behemoth, helpful, precise, and safe."},
      {"role":"user","content":"Outline a retrieval pipeline for legal documents."}
    ],
    "max_tokens": 512,
    "temperature": 0.7,
    "top_p": 0.95
  }'

Note: compressed-tensors is a vLLM runtime format. Loading this artifact directly in vanilla 🤗 Transformers is not supported; use vLLM for inference. If you need Transformers inference, use a different export (e.g., GPTQ/AWQ .safetensors compatible with Transformers) or full‑precision weights.


Prompting / Chat Template

This package inherits the Mistral‑Instruct chat conventions from its parent finetune. If a chat_template.jinja is present, it is applied automatically by apply_chat_template within serving stacks that support it.

Tips

  • Provide a concise system role.
  • Structure multi‑step user prompts explicitly.
  • For tool use, include clear schemas and results to minimize hallucinations.

Recommended Generation Settings

Starting points (tune for your latency/quality targets):

  • temperature: 0.4–0.9 (0.6–0.8 common)
  • top_p: 0.9–0.95
  • max_new_tokens: 256–2048+
  • Optional repetition_penalty: 1.05–1.15
  • Enable vLLM batching/scheduling features for throughput.

Hardware Guidance

  • 123B is large; multi‑GPU with tensor parallelism is recommended.
  • Quantization reduces weights memory; KV cache (activations) still dominates at long context. Adjust --max-model-len and batch size accordingly.
  • Use BF16 where supported; otherwise FP16.
  • CUDA Graphs can help if stable in your stack.

Evaluation & Safety

  • No official benchmark set is included; evaluate on your tasks before production.
  • Apply content safety, guardrails, and human review for high‑stakes use cases.

License & Usage

This distribution inherits licenses/restrictions of:

  • mistralai/Mistral-Large-Instruct-2411 (base)
  • TheDrummer/Behemoth-R1-123B-v2 (finetune)

Using this model implies acceptance of the upstream terms.


Changelog

  • v2 (current)Quantized releases (AWQ W4A16_ASYM and INT8 W8A16) under TheHouseOfTheDude.

Links

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for TheHouseOfTheDude/Behemoth-R1-123B-v2_Compressed-Tensors

Quantized
(11)
this model