Quantization Recipe
Install uv by following https://docs.astral.sh/uv/getting-started/installation/
uv venv ~/.uv-hf --python 3.13
source ~/.uv-hf/bin/activate
uv pip install transformers==4.56.2 'trl[vllm]==0.23.1' tensorboard
uv pip install --pre --index-url https://download.pytorch.org/whl/nightly/cu126 torchao
QAT Finetuning with PARQ
We apply QAT with a torchao optimizer-only package called PARQ. This model is finetuned on grade-school math data in order to maximize performance on gsm8k.
The training command is provided below for reproducibility. Note that the model is initialized from a 2-bit model, lvj/Qwen3-4B-parq-2b-weight-4b-embed-shared-hf. To optimize for other tasks, replace --dataset_name with a custom finetuning dataset and remove --dataset_sources.
source ~/.uv-hf/bin/activate
SEED=$RANDOM
SAVE_DIR=checkpoints/qwen3-2bit-gsm-finetune-${SEED}
ngpu=1
device_batch_size=4
grad_accum_steps=8
lr=8e-5
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
uv run "https://huggingface.co/datasets/lvj/parq-sft/resolve/main/qat_sft.py" \
--model_name_or_path lvj/Qwen3-4B-parq-2b-weight-4b-embed-shared-hf \
--bf16 True \
--num_train_epochs 1 \
--per_device_train_batch_size $device_batch_size \
--gradient_accumulation_steps $grad_accum_steps \
--dataset_name allenai/tulu-3-sft-olmo-2-mixture-0225 \
--dataset_sources ai2-adapt-dev/tulu_v3.9_open_math_2_gsm8k_50k,allenai/tulu-3-sft-personas-math-grade-filtered \
--dataloader_num_workers 4 \
--save_steps 1500 \
--save_total_limit 1 \
--report_to tensorboard \
--logging_steps 2 \
--learning_rate $lr \
--lr_scheduler_type linear \
--warmup_ratio 0.0 \
--seed $SEED \
--output_dir $SAVE_DIR \
--enable_thinking \
--weight_bits 2 \
--linear_pat 'proj\.weight$' \
--embed_pat '(lm_head|embed_tokens)'
Generation from Quantized Model
Note: to push_to_hub you need to run
pip install -U "huggingface_hub[cli]"
huggingface-cli login
and use a token with write access, from https://huggingface.co/settings/tokens
To get the quantized model, run the following from the root of hf-scripts/:
import os
from huggingface_hub import whoami, get_token
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
set_seed,
)
set_seed(0)
model_path = f"{SAVE_DIR}"
model = AutoModelForCausalLM.from_pretrained(
model_path, device_map="auto", dtype="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_path)
# Manual testing
prompt = "John writes 20 pages a day. How long will it take him to write 3 books that are 400 pages each?"
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": prompt},
]
templated_prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
inputs = tokenizer(templated_prompt, return_tensors="pt").to(model.device)
inputs.pop("token_type_ids", None)
start_idx = len(inputs.input_ids[0])
response_ids = model.generate(**inputs, max_new_tokens=1024, **kwargs)[0]
response_ids = response_ids[start_idx:].tolist()
output_text = tokenizer.decode(response_ids, skip_special_tokens=True)
print(output_text)
# Push to hub
token = get_token()
username = whoami(token=token)["name"]
model_name = os.path.basename(model_path)
save_to = os.path.join(username, model_name)
model.push_to_hub(save_to, safe_serialization=False)
tokenizer.push_to_hub(save_to)
The response from manual testing is:
<think>
</think>
Since John writes 20 pages a day, we need to first calculate the total number of pages he needs to write.
The total number of pages is 3 books * 400 pages/book = 1200 pages.
Since John writes 20 pages a day, the number of days it will take him to write 1200 pages is 1200 pages / 20 pages/day = 60 days.
Thus, it will take John \boxed{60} days to write the 3 books.
Model Quality
The model scores 67.93 on gsm8k (flexible-extract) with the below command.
lm-eval \
--model hf \
--model_args pretrained=lvj/Qwen3-4B-parq-2b-weight-4b-embed-shared-gsm,dtype=auto \
--gen_kwargs max_new_tokens=1024 \
--apply_chat_template \
--fewshot_as_multiturn \
--num_fewshot 5 \
--tasks gsm8k \
--batch_size auto \
--trust_remote_code
Exporting to ExecuTorch
⚠️ Note: These instructions only work on Arm-based machines. Running them on x86_64 will fail.
We can run the quantized model on a mobile phone using ExecuTorch. Once ExecuTorch is set-up, exporting and running the model on device is a breeze.
To set up ExecuTorch, run the following commands:
git clone https://github.com/pytorch/executorch.git
pushd executorch
git submodule update --init --recursive
python install_executorch.py
popd
Next install the latest version of torchao:
git clone https://github.com/pytorch/ao.git
pushd ao
pip install .
popd
(The above command will install the right kernels on Arm-based Mac; to use Arm-based Linux define the following environment variables before pip installing torchao: BUILD_TORCHAO_EXPERIMENTAL=1 TORCHAO_BUILD_CPU_AARCH64=1 TORCHAO_BUILD_KLEIDIAI=1 TORCHAO_ENABLE_ARM_NEON_DOT=1 TORCHAO_PARALLEL_BACKEND=OPENMP).
ExecuTorch's LLM export scripts require the checkpoint keys and parameters have certain names, which differ from those used in Hugging Face. So we first use a script that converts the Hugging Face checkpoint key names to ones that ExecuTorch expects: The following script does this for you.
python -m executorch.examples.models.qwen3.convert_weights $(hf download lvj/Qwen3-4B-parq-2b-weight-4b-embed-shared) pytorch_model_converted.bin
Once we have the checkpoint, we export it to ExecuTorch with a max_seq_length/max_context_length of 1024 using the torchao lowbit kernels as follows. To export, we must be on an an Arm-based Mac or Linux machine.
(Note: ExecuTorch LLM export script requires config.json have certain key names. The correct config to use for the LLM export script is located at examples/models/qwen3/config/4b_config.json within the ExecuTorch repo.)
python -m executorch.examples.models.llama.export_llama \
--model "qwen3_4b" \
--checkpoint pytorch_model_converted.bin \
--params examples/models/qwen3/config/4b_config.json \
--output_name model.pte \
-kv \
--use_sdpa_with_kv_cache \
--use-torchao-kernels \
--max_context_length 1024 \
--max_seq_length 1024 \
--dtype fp32 \
--metadata '{"get_bos_id":151644, "get_eos_ids":[151643, 151645]}'
After that you can run the model in a mobile app (see Running in a mobile app).
(We try to keep these instructions up-to-date, but if you find they do not work, check out our CI test in ExecuTorch for the latest source of truth, and let us know we need to update our model card.)
- Downloads last month
- 43
Model tree for lvj/Qwen3-4B-parq-2b-weight-4b-embed-shared-gsm
Base model
Qwen/Qwen3-4B-Base