MiniLingua-1b

MiniLingua-1b is a multilingual base language model with approximately 1 billion parameters, trained from scratch with a custom sentencepiece 128k token tokenizer supporting the following languages:

Bulgarian, Czech, Dutch, English, Finnish, French, German, Greek, Italian, Polish, Portuguese, Spanish, Swedish, and programming code.

Training Details

MiniLingua-1b was trained on a 1 trillion token corpus that includes:

The model was trained for 1.5 epochs over 12 days on the LUMI supercomputer, using:

  • 256 AMD MI250X GPUs
  • bf16 precision
  • Megatron-LM library
  • Data parellelism

Intended Use

This model serves as a multilingual base LLM, suitable for instruction tuning, research, and language understanding tasks in low- and high-resource European languages.

License

Apache 2.0 โ€” free for research and commercial use, subject to the terms.


Downloads last month
9
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support