Quantization Recipe

Install uv by following https://docs.astral.sh/uv/getting-started/installation/

uv venv ~/.uv-hf --python 3.13
source ~/.uv-hf/bin/activate
uv pip install transformers==4.56.2 'trl[vllm]==0.23.1' tensorboard
uv pip install --pre --index-url https://download.pytorch.org/whl/nightly/cu126 torchao

QAT Finetuning with PARQ

We apply QAT with a torchao optimizer-only package called PARQ. This model is finetuned on grade-school math data in order to maximize performance on gsm8k.

The training command is provided below for reproducibility. Note that the model is initialized from a 2-bit model, lvj/Qwen3-4B-parq-2b-weight-4b-embed-shared-hf. To optimize for other tasks, replace --dataset_name with a custom finetuning dataset and remove --dataset_sources.

source ~/.uv-hf/bin/activate

SEED=$RANDOM
SAVE_DIR=checkpoints/qwen3-2bit-gsm-finetune-${SEED}

ngpu=1
device_batch_size=4
grad_accum_steps=8
lr=8e-5
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
    uv run "https://huggingface.co/datasets/lvj/parq-sft/resolve/main/qat_sft.py" \
    --model_name_or_path lvj/Qwen3-4B-parq-2b-weight-4b-embed-shared-hf \
    --bf16 True \
    --num_train_epochs 1 \
    --per_device_train_batch_size $device_batch_size \
    --gradient_accumulation_steps $grad_accum_steps \
    --dataset_name allenai/tulu-3-sft-olmo-2-mixture-0225 \
    --dataset_sources ai2-adapt-dev/tulu_v3.9_open_math_2_gsm8k_50k,allenai/tulu-3-sft-personas-math-grade-filtered \
    --dataloader_num_workers 4 \
    --save_steps 1500 \
    --save_total_limit 1 \
    --report_to tensorboard \
    --logging_steps 2 \
    --learning_rate $lr \
    --lr_scheduler_type linear \
    --warmup_ratio 0.0 \
    --seed $SEED \
    --output_dir $SAVE_DIR \
    --enable_thinking \
    --weight_bits 2 \
    --linear_pat 'proj\.weight$' \
    --embed_pat '(lm_head|embed_tokens)'

Generation from Quantized Model

Note: to push_to_hub you need to run

pip install -U "huggingface_hub[cli]"
huggingface-cli login

and use a token with write access, from https://huggingface.co/settings/tokens

To get the quantized model, run the following from the root of hf-scripts/:

import os

from huggingface_hub import whoami, get_token
from transformers import (
  AutoModelForCausalLM,
  AutoTokenizer,
  set_seed,
)

set_seed(0)
model_path = f"{SAVE_DIR}"
model = AutoModelForCausalLM.from_pretrained(
    model_path, device_map="auto", dtype="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_path)

# Manual testing
prompt = "John writes 20 pages a day. How long will it take him to write 3 books that are 400 pages each?"
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt},
]
templated_prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
inputs = tokenizer(templated_prompt, return_tensors="pt").to(model.device)
inputs.pop("token_type_ids", None)

start_idx = len(inputs.input_ids[0])
response_ids = model.generate(**inputs, max_new_tokens=1024, **kwargs)[0]
response_ids = response_ids[start_idx:].tolist()
output_text = tokenizer.decode(response_ids, skip_special_tokens=True)
print(output_text)

# Push to hub
token = get_token()
username = whoami(token=token)["name"]
model_name = os.path.basename(model_path)
save_to = os.path.join(username, model_name)
model.push_to_hub(save_to, safe_serialization=False)
tokenizer.push_to_hub(save_to)

The response from manual testing is:

<think>

</think>

Since John writes 20 pages a day, we need to first calculate the total number of pages he needs to write.
The total number of pages is 3 books * 400 pages/book = 1200 pages.
Since John writes 20 pages a day, the number of days it will take him to write 1200 pages is 1200 pages / 20 pages/day = 60 days.
Thus, it will take John \boxed{60} days to write the 3 books.

Model Quality

The model scores 67.93 on gsm8k (flexible-extract) with the below command.

lm-eval \
  --model hf \
  --model_args pretrained=lvj/Qwen3-4B-parq-2b-weight-4b-embed-shared-gsm,dtype=auto \
  --gen_kwargs max_new_tokens=1024 \
  --apply_chat_template \
  --fewshot_as_multiturn \
  --num_fewshot 5 \
  --tasks gsm8k \
  --batch_size auto \
  --trust_remote_code

Exporting to ExecuTorch

⚠️ Note: These instructions only work on Arm-based machines. Running them on x86_64 will fail.

We can run the quantized model on a mobile phone using ExecuTorch. Once ExecuTorch is set-up, exporting and running the model on device is a breeze.

To set up ExecuTorch, run the following commands:

git clone https://github.com/pytorch/executorch.git                   
pushd executorch           
git submodule update --init --recursive 
python install_executorch.py
popd

Next install the latest version of torchao:

git clone https://github.com/pytorch/ao.git
pushd ao 
pip install . 
popd

(The above command will install the right kernels on Arm-based Mac; to use Arm-based Linux define the following environment variables before pip installing torchao: BUILD_TORCHAO_EXPERIMENTAL=1 TORCHAO_BUILD_CPU_AARCH64=1 TORCHAO_BUILD_KLEIDIAI=1 TORCHAO_ENABLE_ARM_NEON_DOT=1 TORCHAO_PARALLEL_BACKEND=OPENMP).

ExecuTorch's LLM export scripts require the checkpoint keys and parameters have certain names, which differ from those used in Hugging Face. So we first use a script that converts the Hugging Face checkpoint key names to ones that ExecuTorch expects: The following script does this for you.

python -m executorch.examples.models.qwen3.convert_weights $(hf download lvj/Qwen3-4B-parq-2b-weight-4b-embed-shared) pytorch_model_converted.bin

Once we have the checkpoint, we export it to ExecuTorch with a max_seq_length/max_context_length of 1024 using the torchao lowbit kernels as follows. To export, we must be on an an Arm-based Mac or Linux machine.

(Note: ExecuTorch LLM export script requires config.json have certain key names. The correct config to use for the LLM export script is located at examples/models/qwen3/config/4b_config.json within the ExecuTorch repo.)

python -m executorch.examples.models.llama.export_llama \
  --model "qwen3_4b" \
  --checkpoint pytorch_model_converted.bin \
  --params examples/models/qwen3/config/4b_config.json \
  --output_name model.pte \
  -kv \
  --use_sdpa_with_kv_cache \
  --use-torchao-kernels \
  --max_context_length 1024 \
  --max_seq_length 1024 \
  --dtype fp32 \
  --metadata '{"get_bos_id":151644, "get_eos_ids":[151643, 151645]}'

After that you can run the model in a mobile app (see Running in a mobile app).

(We try to keep these instructions up-to-date, but if you find they do not work, check out our CI test in ExecuTorch for the latest source of truth, and let us know we need to update our model card.)

Downloads last month
43
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for lvj/Qwen3-4B-parq-2b-weight-4b-embed-shared-gsm

Base model

Qwen/Qwen3-4B-Base
Finetuned
Qwen/Qwen3-4B
Quantized
(1)
this model

Datasets used to train lvj/Qwen3-4B-parq-2b-weight-4b-embed-shared-gsm