---
license: apache-2.0
language:
  - bg
  - cs
  - nl
  - en
  - fi
  - fr
  - de
  - el
  - it
  - pl
  - pt
  - es
  - sv
  - code
tags:
  - multilingual
  - base-model
  - transformer
  - decoder-only
  - LLM
  - smol
  - MiniLingua
---

# MiniLingua-1b

**MiniLingua-1b** is a multilingual base language model with approximately 1 billion parameters, trained from scratch with a custom sentencepiece 128k token tokenizer supporting the following languages:

Bulgarian, Czech, Dutch, English, Finnish, French, German, Greek, Italian, Polish, Portuguese, Spanish, Swedish, and programming code.

### Training Details

MiniLingua-1b was trained on a 1 trillion token corpus that includes:
- [FineWeb-2](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2)
- [The Stack](https://huggingface.co/datasets/bigcode/the-stack)
- Curated high-quality multilingual and code data from public sources

The model was trained for 1.5 epochs over 12 days on the [LUMI supercomputer](https://lumi-supercomputer.eu/), using:
- 256 AMD MI250X GPUs
- bf16 precision
- Megatron-LM library
- Data parellelism

### Intended Use

This model serves as a multilingual base LLM, suitable for instruction tuning, research, and language understanding tasks in low- and high-resource European languages.

### License

Apache 2.0 — free for research and commercial use, subject to the terms.

---