KAT-Dev-72B-Exp - GPTQ INT4 (group_size=32)

This is a GPTQ quantized version of Kwaipilot/KAT-Dev-72B-Exp.

Quantization Details

  • Method: GPTQ (GPT Quantization)
  • Bits: 4
  • Group Size: 32
  • Quantization Type: INT
  • Symmetric: True
  • Calibration Samples: 128
  • Calibration Dataset: allenai/c4
  • Max Sequence Length: 512

Hardware Used for Quantization

  • 6x NVIDIA GeForce RTX 5090 (32GB each)
  • CUDA 12.8+
  • Sequential layer-by-layer processing (OOM-safe)

Usage

With vLLM

from vllm import LLM, SamplingParams

# Initialize the model
llm = LLM(model="Vykyan/KAT-Dev-72B-Exp-GPTQ-INT4-gs32", trust_remote_code=True)

# Create sampling parameters
sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=512,
)

# Generate text
prompts = ["Hello, how are you?"]
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(output.outputs[0].text)

With Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "Vykyan/KAT-Dev-72B-Exp-GPTQ-INT4-gs32",
    device_map="auto",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
    "Vykyan/KAT-Dev-72B-Exp-GPTQ-INT4-gs32",
    trust_remote_code=True
)

# Generate
inputs = tokenizer("Hello, how are you?", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Inference Performance

This quantized model offers:

  • ~4x memory reduction compared to FP16
  • Faster inference on compatible hardware
  • Maintained accuracy through GPTQ quantization

Recommended Hardware

  • NVIDIA GPUs with compute capability 7.5+ (RTX 20-series or newer)
  • Minimum 24GB VRAM for single-GPU inference
  • Multi-GPU setup for larger batch sizes

Model Details

  • Base Model: Kwaipilot/KAT-Dev-72B-Exp
  • Quantization Tool: llm-compressor
  • Compatible Inference Engines: vLLM, TGI (Text Generation Inference)

Limitations

  • Quantization may affect model accuracy on certain tasks
  • Requires vLLM or compatible inference engine for optimal performance

Acknowledgements

Downloads last month
-
Safetensors
Model size
13B params
Tensor type
I64
I32
BF16
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for MidnightPhreaker/KAT-Dev-72B-Exp-GPTQ-INT4-gs32-0.01

Quantized
(17)
this model