File size: 11,450 Bytes

1d62a78

---
license: apache-2.0
language:
- ar
- en
pipeline_tag: text-generation
tags:
- pytorch
library_name: transformers
base_model: QCRI/Fanar-1-9B
---

<p align="center">
  <img src="./fanar_logo.jpg" width="200"/>
</p>

# Fanar-1-9B-Instruct

**Fanar-1-9B-Instruct** is a powerful Arabic-English LLM developed by [Qatar Computing Research Institute (QCRI)](https://www.hbku.edu.qa/en/qcri) at [Hamad Bin Khalifa University (HBKU)](https://www.hbku.edu.qa/), a member of Qatar Foundation for Education, Science, and Community Development. It is the instruction-tuned version of [Fanar-1-9B](https://huggingface.co/QCRI/Fanar-1-9B). We continually pretrain the `google/gemma-2-9b` model on 1T Arabic and English tokens. We pay particular attention to the richness of the Arabic language by supporting Modern Standard Arabic (MSA) and a diverse set of Arabic dialects, including Gulf, Levantine, and Egyptian. Fanar models, through meticulous curation of the pretraining and instruction-tuning data, are aligned with Islamic values and Arab cultures.

**Fanar-1-9B-Instruct** is a core component of the [Fanar GenAI platform](https://fanar.qa/) that offers a suite of capabilities including image generation, video and image understanding, deep thinking, advanced text-to-speech (TTS) and automatic-speech-recognition (ASR), attribution and fact-checking, Islamic RAG, among several other features.

We have published a comprehensive [report](https://arxiv.org/pdf/2501.13944) with all the details regarding our Fanar GenAI platform. We also provide an API to our models and the GenAI platform (request access [here](https://api.fanar.qa/request/en)).

---

## Model Details

| Attribute                  | Value                              |
|---------------------------|------------------------------------|
| Developed by              | [QCRI](https://www.hbku.edu.qa/en/qcri) at [HBKU](https://www.hbku.edu.qa/)                      |
| Sponsored by              | [Ministry of Communications and Information Technology, State of Qatar](https://www.mcit.gov.qa/en/)
| Model Type                | Autoregressive Transformer         |
| Parameter Count           | 8.7 Billion                          |
| Context Length            | 4096 Tokens                        |
| Input                     | Text only                          |
| Output                    | Text only                          |
| Training Framework        | [LitGPT](https://github.com/Lightning-AI/litgpt) |
| Pretraining Token Count   | 1 Trillion (ar + en) |
| SFT Instructions          | 4.5M                               |
| DPO Preference Pairs      | 250K                               |
| Languages                 | Arabic, English                    |
| License                   | Apache 2.0                         |

<!-- | Precision                 | bfloat16                           | -->

---

## Model Training

#### Pretraining
Fanar-1-9B-Instruct was continually pretrained on 1T tokens, with a balanced focus on Arabic and English: ~515B English tokens from a carefully curated subset of the [Dolma](https://huggingface.co/datasets/allenai/dolma) dataset, 410B Arabic tokens that we collected, parsed, and filtered from a variety of sources, and 102B code tokens curated from [The Stack](https://github.com/bigcode-project/the-stack-v2) dataset. Our codebase used the [LitGPT](https://github.com/Lightning-AI/litgpt) framework.

#### Post-training
Fanar-1-9B-Instruct underwent a two-phase post-training pipeline:

| Phase | Size |
|-------|------|
| Supervised Fine-tuning (SFT) | 4.5M Instructions |
| Direct Preference Optimization (DPO) | 250K Preference Pairs |

---


## Getting Started

Fanar-1-9B-Instruct is compatible with the Hugging Face `transformers` library (≥ v4.40.0). Here's how to load and use the model:

```python
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = "QCRI/Fanar-1-9B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")

# message content may be in Arabic or English
messages = [
    {"role": "user", "content": "ما هي عاصمة قطر؟"},
]

inputs = tokenizer.apply_chat_template(messages, tokenize=False, return_tensors="pt")
outputs = model.generate(**tokenizer(inputs, return_tensors="pt", return_token_type_ids=False), max_new_tokens=256)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

Inference using VLLM is also supported:

```python

from vllm import LLM, SamplingParams

model_name = "QCRI/Fanar-1-9B-Instruct"

llm = LLM(model=model_name)
sampling_params = SamplingParams(temperature=0.7, max_tokens=256)

# message content may be in Arabic or English
messages = [
    {"role": "user", "content": "ما هي عاصمة قطر؟"},
]

outputs = llm.chat(messages, sampling_params)
print(outputs[0].outputs[0].text)
```
---

## Intended Use

Fanar-1-9B-Instruct is built for:

- Conversational agents (Arabic only or bilingual)
- Cultural and dialectal question answering in Arabic
- Educational, governmental, and civic NLP applications focused on the Arab world or Arabic-speaking audiences
- Research on Arabic natural language generation and understanding

Fanar-1-9B-Instruct can be deployed as part of a broader AI system. Developers are encouraged to implement proper safeguards to ensure culturally respectful, accurate, and safe deployment. It should not be used to generate or spread **harmful, illegal, or misleading content.**

A version of this model can be accessed through [Fanar Chat](chat.fanar.qa). We are continuously improving the Fanar’s models and capabilities, and answers can differ from what you get from Fanar-1-9B-Instruct.

--- 

## Ethical Considerations & Limitations

Fanar-1-9B-Instruct is capable of generating fluent and contextually appropriate responses. However, as with any generative model there are uncertainities. The model may produce **biased, offensive, or incorrect outputs**. The model is **not suitable for high-stakes decision-making** (e.g., legal, medical, or financial advice). Though we have extensively tested Fanar-1-9B-Instruct and attempted to mitigate these issues, we cannot redress every possible scenario. Thus, we advise developers to implement safety checks and perform domain-specific fine-tuning for sensitive use cases. Kindly refer to our [Terms of Service]( https://chat.fanar.qa/terms-of-service) and [Privacy Policy](https://chat.fanar.qa/privacy-policy).

The output generated by this model is not considered a statement of QCRI, HBKU, Qatar Foundation, MCIT or any other organization or individual.
 
---

## Evaluation

Evaluation was conducted using a modified version of the LM Evaluation Harness and internal cultural alignment benchmarks.

<div style="overflow-x: auto;">

| Model | MMLU (5-shot) | MMMLU (Arabic) (0-shot) | ArabicMMLU (3-shot) | HellaSwag (0-shot) | PIQA (0-shot) | ARC Challenge (0-shot) | Belebele (Arabic) (3-shot) | ACVA (5-shot) | GSM8k | OALL (0-shot) | OALL v2 (0-shot) | Almieyar Arabic (3-shot) | Arab Cultural MCQ (3-shot) | AraDiCE PIQA (MSA) (0-shot) | AraDiCE PIQA(Egy) (0-shot) | AraDiCE PIQA(Lev) (0-shot) | AraDiCE ArabicMMLU(Egy) (0-shot) | AraDiCE ArabicMMLU(Lev) (0-shot) |
|-------|----------------|--------------------------|----------------------|--------------------|---------------|-------------------------|------------------------------|---------------|--------|----------------|------------------|---------------------------|-----------------------------|-------------------------------|------------------------------|------------------------------|-----------------------------------|-----------------------------------|
| Fanar-1-9B-it | 71.53% | **58.89%** | 67.69% | **83.16%** | **82.54%** | **67.15%** | **83.22%** | 80.02% | **74.60%** | **68.32%** | **66.29%** | **78.68%** | 72.40% | **67.68%** | **63.66%** | 59.03% | 59.63% | 60.62% |
| ALLaM-7B-Instruct-preview | 60.72% | 54.89% | **68.59%** | 76.35% | 80.52% | 51.62% | 75.80% | 74.52% | 46.63% | 57.31% | 63.66% | 76.31% | **74.20%** | 67.52% | 63.44% | **60.88%** | **62.50%** | **64.17%** |
| aya-expanse-8b | 62.85% | 47.14% | 60.10% | 78.54% | 81.18% | 56.40% | 70.78% | 77.11% | 8.26% | 53.18% | 59.74% | 70.20% | 67.30% | 63.00% | 59.41% | 56.53% | 53.52% | 53.71% |
| c4ai-command-r7b-arabic-02-2025 | 66.91% | 49.54% | 63.06% | 74.67% | 78.02% | 49.15% | 72.78% | 79.80% | 30.33% | 49.38% | 64.44% | 73.82% | 69.20% | 62.30% | 60.99% | 56.69% | 54.78% | 56.06% |
| AceGPT-v2-8B-Chat | 66.45% | 51.16% | 62.61% | 79.21% | 80.58% | 53.50% | 74.56% | 77.66% | 41.77% | 50.16% | 60.40% | 74.31% | 68.90% | 64.58% | 61.32% | 56.91% | 54.53% | 53.91% |
| gemma-2-9b-it | 71.65% | 57.93% | 64.16% | 79.06% | 79.38% | 63.99% | 78.31% | **80.67%** | 60.95% | 56.11% | 64.21% | 73.69% | 68.60% | 61.26% | 59.96% | 57.24% | 57.95% | 59.25% |
| jais-adapted-13b-chat | 56.64% | 44.45% | 58.97% | 80.86% | 80.47% | 54.27% | 67.52% | 75.24% | 44.05% | 46.41% | 56.56% | 65.46% | 65.30% | 61.10% | 58.05% | 55.77% | 52.87% | 53.59% |
| jais-family-6p7b-chat | 49.42% | 41.59% | 55.80% | 72.04% | 74.05% | 44.62% | 65.11% | 72.04% | 53.68% | 48.20% |    54.73%    | 61.72% | 64.10% | 62.51% | 60.12% | 57.24% | 49.11% | 47.49% |
| Llama-3.1-8B-Instruct | 68.04% | 47.58% | 59.05% | 79.22% | 80.74% | 55.29% | 66.72% | 76.67% | 29.26% | 47.81% | 55.97% | 69.70% | 66.10% | 58.11% | 55.39% | 54.24% | 46.86% | 47.52% |
| Qwen2.5-7B-Instruct | **74.21%** | 55.63% | 63.96% | 80.44% | 79.92% | 55.03% | 74.61% | 78.09% | 71.34% | 54.19% | 62.69% | 75.69% | 68.10% | 60.55% | 58.65% | 56.04% | 48.74% | 53.42% |

</div>


---

## Citation

If you use Fanar-1-9B-Instruct or the Fanar GenAI system in your research or applications, please cite:

```bibtex
@misc{fanarllm2025,
      title={Fanar: An Arabic-Centric Multimodal Generative AI Platform}, 
      author={Fanar Team and Ummar Abbas and Mohammad Shahmeer Ahmad and Firoj Alam and Enes Altinisik and Ehsannedin Asgari and Yazan Boshmaf and Sabri Boughorbel and Sanjay Chawla and Shammur Chowdhury and Fahim Dalvi and Kareem Darwish and Nadir Durrani and Mohamed Elfeky and Ahmed Elmagarmid and Mohamed Eltabakh and Masoomali Fatehkia and Anastasios Fragkopoulos and Maram Hasanain and Majd Hawasly and Mus'ab Husaini and Soon-Gyo Jung and Ji Kim Lucas and Walid Magdy and Safa Messaoud and Abubakr Mohamed and Tasnim Mohiuddin and Basel Mousi and Hamdy Mubarak and Ahmad Musleh and Zan Naeem and Mourad Ouzzani and Dorde Popovic and Amin Sadeghi and Husrev Taha Sencar and Mohammed Shinoy and Omar Sinan and Yifan Zhang and Ahmed Ali and Yassine El Kheir and Xiaosong Ma and Chaoyi Ruan}},
      year={2025},
      url={https://arxiv.org/abs/2501.13944}, 
}
```

---

## Acknowledgements

This project is from [Qatar Computing Research Institute (QCRI)](https://qcri.org) at [Hamad Bin Khalifa University (HBKU)](https://hbku.edu.qa), a member of Qatar Foundation. We thank our engineers, researchers, and support team for their efforts in advancing Arabic-centric large language models.
Special thanks to the [Ministry of Communications and Information Technology, State of Qatar](https://www.mcit.gov.qa/en/) for their continued support by providing the compute infrastructure through the Google Cloud Platform.


---

## License

This model is licensed under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0).