--- license: apache-2.0 language: - bg - cs - nl - en - fi - fr - de - el - it - pl - pt - es - sv - code tags: - multilingual - base-model - transformer - decoder-only - LLM - smol - MiniLingua --- # MiniLingua-1b **MiniLingua-1b** is a multilingual base language model with approximately 1 billion parameters, trained from scratch with a custom sentencepiece 128k token tokenizer supporting the following languages: Bulgarian, Czech, Dutch, English, Finnish, French, German, Greek, Italian, Polish, Portuguese, Spanish, Swedish, and programming code. ### Training Details MiniLingua-1b was trained on a 1 trillion token corpus that includes: - [FineWeb-2](https://huggingface.co/datasets/HuggingFaceFW/fineweb-2) - [The Stack](https://huggingface.co/datasets/bigcode/the-stack) - Curated high-quality multilingual and code data from public sources The model was trained for 1.5 epochs over 12 days on the [LUMI supercomputer](https://lumi-supercomputer.eu/), using: - 256 AMD MI250X GPUs - bf16 precision - Megatron-LM library - Data parellelism ### Intended Use This model serves as a multilingual base LLM, suitable for instruction tuning, research, and language understanding tasks in low- and high-resource European languages. ### License Apache 2.0 — free for research and commercial use, subject to the terms. ---