KAT-Dev-72B-Exp - GPTQ INT4 (group_size=32)

This is a GPTQ quantized version of Kwaipilot/KAT-Dev-72B-Exp.

Quantization Details

Method: GPTQ (GPT Quantization)
Bits: 4
Group Size: 32
Quantization Type: INT
Symmetric: True
Calibration Samples: 128
Calibration Dataset: allenai/c4
Max Sequence Length: 512

Hardware Used for Quantization

6x NVIDIA GeForce RTX 5090 (32GB each)
CUDA 12.8+
Sequential layer-by-layer processing (OOM-safe)

Usage

With vLLM

from vllm import LLM, SamplingParams

# Initialize the model
llm = LLM(model="Vykyan/KAT-Dev-72B-Exp-GPTQ-INT4-gs32", trust_remote_code=True)

# Create sampling parameters
sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=512,
)

# Generate text
prompts = ["Hello, how are you?"]
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(output.outputs[0].text)

With Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "Vykyan/KAT-Dev-72B-Exp-GPTQ-INT4-gs32",
    device_map="auto",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
    "Vykyan/KAT-Dev-72B-Exp-GPTQ-INT4-gs32",
    trust_remote_code=True
)

# Generate
inputs = tokenizer("Hello, how are you?", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Inference Performance

This quantized model offers:

~4x memory reduction compared to FP16
Faster inference on compatible hardware
Maintained accuracy through GPTQ quantization

Recommended Hardware

NVIDIA GPUs with compute capability 7.5+ (RTX 20-series or newer)
Minimum 24GB VRAM for single-GPU inference
Multi-GPU setup for larger batch sizes

Model Details

Base Model: Kwaipilot/KAT-Dev-72B-Exp
Quantization Tool: llm-compressor
Compatible Inference Engines: vLLM, TGI (Text Generation Inference)

Limitations

Quantization may affect model accuracy on certain tasks
Requires vLLM or compatible inference engine for optimal performance

Acknowledgements

Base model: Kwaipilot/KAT-Dev-72B-Exp
Quantization: llm-compressor
Inference: vLLM

Downloads last month: 26

Safetensors

Model size

13B params

Tensor type

I64

I32

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for MidnightPhreaker/KAT-Dev-72B-Exp-GPTQ-INT4-gs32

Base model

Kwaipilot/KAT-Dev-72B-Exp

Quantized

(17)

this model