--- license: apache-2.0 language: - en base_model: Qwen/Qwen2.5-14B-Instruct tags: - qwen2.5 - quantized - 4-bit - fine-tuned library_name: transformers --- # Qwen2.5-14B Fine-tuned (4-bit Quantized) This is a 4-bit quantized version of a fine-tuned Qwen2.5-14B model. ## Model Details - **Base Model:** Qwen/Qwen2.5-14B-Instruct - **Adapter:** Kholod03/qwen2.5-14b-128finetuned128LoRA - **Quantization:** 4-bit (nf4) with BitsAndBytes - **Size:** ~8 GB (compressed from ~28 GB) ## Usage ```python from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig import torch # Configure 4-bit loading bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16, ) # Load model model = AutoModelForCausalLM.from_pretrained( "Kholod03/qwen2.5-14b-128LoRA-4bit", quantization_config=bnb_config, device_map="auto", trust_remote_code=True ) tokenizer = AutoTokenizer.from_pretrained("Kholod03/qwen2.5-14b-128LoRA-4bit") # Generate prompt = "Your prompt here" inputs = tokenizer(prompt, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=100) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ## Requirements ```bash pip install transformers bitsandbytes accelerate ``` ## Performance - Original model: ~28 GB VRAM - Quantized model: ~8 GB VRAM - Quality: ~98% of original - Speed: Faster inference ## Training Details - Method: GRPO with LoRA (rank 128) - Learning rate: 5e-06 - Batch size: 8 - Max steps: 250