File size: 9,241 Bytes

153662b

---
library_name: transformers
pipeline_tag: text-generation
license: mit
tags:
  - quantization
  - sparsity
  - llm
  - qwen2
---

# Optimal Brain Restoration for Joint Quantization and Sparsification of LLMs

This repository contains a compressed version of the `Llama2-70B-Instruct` model, applying the **Optimal Brain Restoration (OBR)** framework as presented in the paper [Optimal Brain Restoration for Joint Quantization and Sparsification of LLMs](https://huggingface.co/papers/2509.11177). OBR enables aggressive W4A4KV4 quantization with 50% sparsity on existing LLMs, delivering significant speedup and memory reduction.

**Code Repository:** [https://github.com/csguoh/OBR](https://github.com/csguoh/OBR)

## Paper Abstract

Recent advances in Large Language Model (LLM) compression, such as quantization and pruning, have achieved notable success. However, as these techniques gradually approach their respective limits, relying on a single method for further compression has become increasingly challenging. In this work, we explore an alternative solution by combining quantization and sparsity. This joint approach, though promising, introduces new difficulties due to the inherently conflicting requirements on weight distributions: quantization favors compact ranges, while pruning benefits from high variance. To attack this problem, we propose Optimal Brain Restoration (OBR), a general and training-free framework that aligns pruning and quantization by error compensation between both. OBR minimizes performance degradation on downstream tasks by building on a second-order Hessian objective, which is then reformulated into a tractable problem through surrogate approximation and ultimately reaches a closed-form solution via group error compensation. Experiments show that OBR enables aggressive W4A4KV4 quantization with 50% sparsity on existing LLMs, and delivers up to 4.72x speedup and 6.4x memory reduction compared to the FP16-dense baseline.

## Model Details

### Model Description

This model is a 4-bit quantized, 50% unstructured sparse version of `Qwen/Qwen2.5-7B-Instruct`. It leverages the Optimal Brain Restoration (OBR) framework, a training-free method that aligns pruning and quantization by error compensation. OBR aims to minimize performance degradation on downstream tasks by using a second-order Hessian objective, reformulated into a tractable problem via surrogate approximation and group error compensation. This specific model instance uses the FlatQuant rotation scheme.

-   **Developed by:** Hang Guo, Yawei Li, Luca Benini
-   **Model type:** Qwen2ForCausalLM (Text Generation)
-   **Language(s) (NLP):** English (primary for evaluation benchmarks)
-   **License:** MIT (see detailed explanation in the License section below)
-   **Finetuned from model:** `Qwen/Qwen2.5-7B-Instruct`

### Model Sources

-   **Repository:** [https://github.com/csguoh/OBR](https://github.com/csguoh/OBR)
-   **Paper:** [https://huggingface.co/papers/2509.11177](https://huggingface.co/papers/2509.11177)
-   **Hugging Face Collection (for OBR models):** [https://huggingface.co/collections/HangGuo/optimal-brain-resotration-689863c8687d3aeed27f9a96](https://huggingface.co/collections/HangGuo/optimal-brain-resotration-689863c8687d3aeed27f9a96)

## Uses

### Direct Use

This model is intended for fast and memory-efficient text generation tasks where a standard `Qwen2.5-7B-Instruct` model would typically be used, but with significantly reduced computational overhead and memory footprint. It is particularly suitable for environments with limited memory or computational resources, or for deploying LLMs at scale.

### Out-of-Scope Use

As a compressed model, while it aims to retain performance, aggressive compression levels might lead to subtle degradation in certain niche tasks or highly sensitive applications. Users should evaluate its performance for their specific use cases. The base model's limitations (e.g., potential biases, factual inaccuracies) also apply.

## Bias, Risks, and Limitations

The base model, `Qwen/Qwen2-70B-Instruct`, may carry inherent biases from its training data. Compression techniques like quantization and sparsification, even when carefully applied with OBR, may also introduce minor performance fluctuations compared to the full-precision, dense model.

### Recommendations

Users should be aware of the trade-offs between model size/speed and potential minor performance shifts introduced by compression. Thorough evaluation on target tasks and datasets is recommended to ensure the model meets specific requirements.

## How to Get Started with the Model

This model is compatible with the Hugging Face `transformers` library. To get started, you can load the model using `AutoModelForCausalLM` and `AutoTokenizer`.

**IMPORTANT:** For compatibility with Qwen2.5 series models, ensure your `transformers` library version is `transformers==4.45.0` or newer. You can install it via `pip install transformers==4.45.0`.

Here's a basic example for text generation:

```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig

# This model ID is part of the OBR Hugging Face collection, specifically for FlatQuant on Qwen2.5
model_id = "HangGuo/QWen2.5-7B-FlatQuant-OBR-GPTQ-W4A4KV4S50"

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16, # Use bfloat16 as indicated in config.json
    device_map="auto",
    trust_remote_code=True
)
model.eval()

# Load generation configuration from the model's own generation_config.json
generation_config = GenerationConfig.from_pretrained(model_id)

# Example prompt for an instruct model using Qwen's chat template
messages = [
    {"role": "system", "content": "You are Qwen, created by Alibaba Cloud. You are a helpful assistant."},
    {"role": "user", "content": "Explain the concept of quantum entanglement in simple terms."}
]
input_ids = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to(model.device)

with torch.no_grad():
    output_ids = model.generate(
        input_ids,
        generation_config=generation_config,
        max_new_tokens=512, # You can override defaults from generation_config.json if needed
        # Other generation parameters can be passed here or set in generation_config
    )

generated_text = tokenizer.decode(output_ids[0][input_ids.shape[1]:], skip_special_tokens=True)
print(f"Prompt: {messages[-1]['content']}
Generated: {generated_text}")
```

For more detailed usage, including how to apply OBR to other base models (like Llama2, Mixtral) and integrate it into evaluation pipelines, please refer to the [official GitHub repository's "Get Started" section](https://github.com/csguoh/OBR#get_started).

## Training Details

### Training Data

The base model, `Qwen/Qwen2.5-7B-Instruct`, was trained on various datasets. The Optimal Brain Restoration (OBR) method is a post-training compression technique and does not involve additional training data for the compression process itself. However, it relies on a small calibration dataset (e.g., WikiText) to restore performance and optimize quantization/sparsity.

### Training Procedure

The OBR framework is training-free. The procedure involves applying pruning and quantization and then compensating for the induced errors using a second-order Hessian objective. Specific parameters for quantization (W4A4KV4) and sparsity (50%) are detailed in the paper and the GitHub repository.

## Evaluation

### Testing Data, Factors & Metrics

Evaluation was conducted on standard benchmarks for LLMs, including WikiText perplexity and various zero-shot accuracy tasks. The paper provides detailed quantitative results and efficiency comparisons (runtime, FLOPs, TOPS).

### Results

Experiments show that OBR enables aggressive W4A4KV4 quantization with 50% sparsity on existing LLMs, delivering up to 4.72x speedup and 6.4x memory reduction compared to the FP16-dense baseline, while maintaining strong performance on downstream tasks.

## Citation

If you find our work useful or helpful for your research, please feel free to cite our paper:

```bibtex
@article{guo2025optimal,
      title={Optimal Brain Restoration for Joint Quantization and Sparsification of LLMs}, 
      author={Hang Guo and Yawei Li and Luca Benini},
      year={2025},
      journal={arXiv preprint arXiv:2509.11177},
      eprint={2509.11177},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={http://arxiv.org/abs/2509.11177},
}
```

## License

This work is based on previous works including [QuaRot](https://github.com/spcl/QuaRot), [SpinQuant](https://github.com/facebookresearch/SpinQuant), and [FlatQuant](https://github.com/ruikangliu/FlatQuant). Users should follow the license of the corresponding backbone models.

For this specific model (`HangGuo/Llama-70B-QuaRot-OBR-GPTQ-W4A4KV4S50`), which is compressed using the QuaRot method, please refer to the QuaRot GitHub repository for full license details.

## Model Card Contact

For any questions, feel free to contact [email protected].