KAT-Dev-72B-Exp - GPTQ INT4 (group_size=32)
This is a GPTQ quantized version of Kwaipilot/KAT-Dev-72B-Exp.
Quantization Details
- Method: GPTQ (GPT Quantization)
- Bits: 4
- Group Size: 32
- Quantization Type: INT
- Symmetric: True
- Calibration Samples: 128
- Calibration Dataset: allenai/c4
- Max Sequence Length: 512
Hardware Used for Quantization
- 6x NVIDIA GeForce RTX 5090 (32GB each)
- CUDA 12.8+
- Sequential layer-by-layer processing (OOM-safe)
Usage
With vLLM
from vllm import LLM, SamplingParams
# Initialize the model
llm = LLM(model="Vykyan/KAT-Dev-72B-Exp-GPTQ-INT4-gs32", trust_remote_code=True)
# Create sampling parameters
sampling_params = SamplingParams(
temperature=0.7,
top_p=0.9,
max_tokens=512,
)
# Generate text
prompts = ["Hello, how are you?"]
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
print(output.outputs[0].text)
With Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"Vykyan/KAT-Dev-72B-Exp-GPTQ-INT4-gs32",
device_map="auto",
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
"Vykyan/KAT-Dev-72B-Exp-GPTQ-INT4-gs32",
trust_remote_code=True
)
# Generate
inputs = tokenizer("Hello, how are you?", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Inference Performance
This quantized model offers:
- ~4x memory reduction compared to FP16
- Faster inference on compatible hardware
- Maintained accuracy through GPTQ quantization
Recommended Hardware
- NVIDIA GPUs with compute capability 7.5+ (RTX 20-series or newer)
- Minimum 24GB VRAM for single-GPU inference
- Multi-GPU setup for larger batch sizes
Model Details
- Base Model: Kwaipilot/KAT-Dev-72B-Exp
- Quantization Tool: llm-compressor
- Compatible Inference Engines: vLLM, TGI (Text Generation Inference)
Limitations
- Quantization may affect model accuracy on certain tasks
- Requires vLLM or compatible inference engine for optimal performance
Acknowledgements
- Base model: Kwaipilot/KAT-Dev-72B-Exp
- Quantization: llm-compressor
- Inference: vLLM
- Downloads last month
- -
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
馃檵
Ask for provider support
Model tree for MidnightPhreaker/KAT-Dev-72B-Exp-GPTQ-INT4-gs32-0.01
Base model
Kwaipilot/KAT-Dev-72B-Exp