Quantization Recipe
Install uv by following https://docs.astral.sh/uv/getting-started/installation/
uv venv ~/.uv-hf --python 3.13
source ~/.uv-hf/bin/activate
uv pip install transformers==4.56.2 'trl[vllm]==0.23.1' tensorboard
uv pip install --pre --index-url https://download.pytorch.org/whl/nightly/cu126 torchao
QAT Finetuning with PARQ
source ~/.uv-hf/bin/activate
SEED=$RANDOM
SAVE_DIR=checkpoints/qwen3-4bit-tulu-finetune-${SEED}
ngpu=1
device_batch_size=4
grad_accum_steps=8
lr=2e-5
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
uv run "https://huggingface.co/datasets/lvj/parq-sft/resolve/main/qat_sft.py" \
--model_name_or_path Qwen/Qwen3-4B \
--bf16 True \
--num_train_epochs 1 \
--per_device_train_batch_size $device_batch_size \
--gradient_accumulation_steps $grad_accum_steps \
--dataset_name allenai/tulu-3-sft-olmo-2-mixture-0225 \
--dataset_sources ai2-adapt-dev/tulu_v3.9_open_math_2_gsm8k_50k,allenai/tulu-3-sft-personas-math-grade-filtered \
--dataloader_num_workers 4 \
--save_steps 1500 \
--save_total_limit 1 \
--report_to tensorboard \
--logging_steps 2 \
--learning_rate $lr \
--lr_scheduler_type linear \
--warmup_ratio 0.0 \
--seed $SEED \
--output_dir $SAVE_DIR \
--weight_bits 4 \
--embed_pat "(lm_head|embed_tokens)" \
--embed_block_size 0
Inference-ready Model Conversion
Note: to push_to_hub you need to run
pip install -U "huggingface_hub[cli]"
huggingface-cli login
and use a token with write access, from https://huggingface.co/settings/tokens
To get the quantized model, run the following from the root of hf-scripts/:
import os
from huggingface_hub import whoami, get_token
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
set_seed,
)
set_seed(0)
model_path = f"{SAVE_DIR}"
model = AutoModelForCausalLM.from_pretrained(
model_path, device_map="auto", dtype="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_path)
# Manual testing
prompt = "John writes 20 pages a day. How long will it take him to write 3 books that are 400 pages each?"
messages = [
{"role": "system", "content": ""},
{"role": "user", "content": prompt},
]
templated_prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
inputs = tokenizer(templated_prompt, return_tensors="pt").to(model.device)
inputs.pop("token_type_ids", None)
start_idx = len(inputs.input_ids[0])
response_ids = model.generate(**inputs, max_new_tokens=256, **kwargs)[0]
response_ids = response_ids[start_idx:].tolist()
output_text = tokenizer.decode(response_ids, skip_special_tokens=True)
print(output_text)
# Push to hub
token = get_token()
username = whoami(token=token)["name"]
model_name = os.path.basename(model_path)
save_to = os.path.join(username, model_name)
model.push_to_hub(save_to, safe_serialization=False)
tokenizer.push_to_hub(save_to)
The response from manual testing is:
Let's compute the total number of pages John has to write. There are 3 books, each with 400 pages. So the total number of pages is 3 * 400 = 1200 pages.
John writes 20 pages a day.
So the number of days it will take him to write 1200 pages is 1200 / 20 = 60 days.
Thus, it will take John \boxed{60} days to write 3 books.
- Downloads last month
- 45