Quantization Recipe

Install uv by following https://docs.astral.sh/uv/getting-started/installation/

uv venv ~/.uv-hf --python 3.13
source ~/.uv-hf/bin/activate
uv pip install transformers==4.56.2 'trl[vllm]==0.23.1' tensorboard
uv pip install --pre --index-url https://download.pytorch.org/whl/nightly/cu126 torchao

QAT Finetuning with PARQ

source ~/.uv-hf/bin/activate

SEED=$RANDOM
SAVE_DIR=checkpoints/qwen3-4bit-tulu-finetune-${SEED}

ngpu=1
device_batch_size=4
grad_accum_steps=8
lr=2e-5
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
    uv run "https://huggingface.co/datasets/lvj/parq-sft/resolve/main/qat_sft.py" \
    --model_name_or_path Qwen/Qwen3-4B \
    --bf16 True \
    --num_train_epochs 1 \
    --per_device_train_batch_size $device_batch_size \
    --gradient_accumulation_steps $grad_accum_steps \
    --dataset_name allenai/tulu-3-sft-olmo-2-mixture-0225 \
    --dataset_sources ai2-adapt-dev/tulu_v3.9_open_math_2_gsm8k_50k,allenai/tulu-3-sft-personas-math-grade-filtered \
    --dataloader_num_workers 4 \
    --save_steps 1500 \
    --save_total_limit 1 \
    --report_to tensorboard \
    --logging_steps 2 \
    --learning_rate $lr \
    --lr_scheduler_type linear \
    --warmup_ratio 0.0 \
    --seed $SEED \
    --output_dir $SAVE_DIR \
    --weight_bits 4 \
    --embed_pat "(lm_head|embed_tokens)" \
    --embed_block_size 0

Inference-ready Model Conversion

Note: to push_to_hub you need to run

pip install -U "huggingface_hub[cli]"
huggingface-cli login

and use a token with write access, from https://huggingface.co/settings/tokens

To get the quantized model, run the following from the root of hf-scripts/:

import os

from huggingface_hub import whoami, get_token
from transformers import (
  AutoModelForCausalLM,
  AutoTokenizer,
  set_seed,
)

set_seed(0)
model_path = f"{SAVE_DIR}"
model = AutoModelForCausalLM.from_pretrained(
    model_path, device_map="auto", dtype="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Manual testing
prompt = "John writes 20 pages a day. How long will it take him to write 3 books that are 400 pages each?"
messages = [
    {"role": "system", "content": ""},
    {"role": "user", "content": prompt},
]
templated_prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
inputs = tokenizer(templated_prompt, return_tensors="pt").to(model.device)
inputs.pop("token_type_ids", None)

start_idx = len(inputs.input_ids[0])
response_ids = model.generate(**inputs, max_new_tokens=256, **kwargs)[0]
response_ids = response_ids[start_idx:].tolist()
output_text = tokenizer.decode(response_ids, skip_special_tokens=True)
print(output_text)

# Push to hub
token = get_token()
username = whoami(token=token)["name"]
model_name = os.path.basename(model_path)
save_to = os.path.join(username, model_name)
model.push_to_hub(save_to, safe_serialization=False)
tokenizer.push_to_hub(save_to)

The response from manual testing is:

Let's compute the total number of pages John has to write. There are 3 books, each with 400 pages. So the total number of pages is 3 * 400 = 1200 pages. 

John writes 20 pages a day. 

So the number of days it will take him to write 1200 pages is 1200 / 20 = 60 days. 

Thus, it will take John \boxed{60} days to write 3 books.

Downloads last month: 45

Model tree for lvj/Qwen3-4B-parq-4b-weight-4b-embed-shared

Base model

Qwen/Qwen3-4B-Base

Finetuned

Qwen/Qwen3-4B

Quantized

(158)

this model