--- license: apache-2.0 pipeline_tag: text-generation library_name: transformers tags: - vllm - stem - merge language: - en base_model: - HuggingFaceTB/SmolLM2-1.7B-Instruct - omniomni/omni-0-science-preview - omniomni/omni-0-technology-preview - omniomni/omni-0-engineering-preview - omniomni/omni-0-math-preview ---

Omni 0 preview models header

GitHub   Website   Paper (Coming Soon)


# Omni 0 Preview Models Omni 0 preview models are a series of 1.7B parameter LLMs optimized for STEM knowledge consisting of 4 expert models and one final merged output between all experts. Experts include: [Science Expert](https://huggingface.co/omniomni/omni-0-science-preview), [Technology Expert](https://huggingface.co/omniomni/omni-0-technology-preview), [Engineering Expert](https://huggingface.co/omniomni/omni-0-engineering-preview), [Math Expert](https://huggingface.co/omniomni/omni-0-math-preview). Through DARE-TIES merging and expert finetuning, Omni is able to achieve state-of-the-art domain benchmark results among alternative optimization techniques, as well as optimal compute-knowledge efficiency across similar models, while also performing comparably with them in tested benchmarks.

Omni compute-knowledge efficiency plot
Omni achieves optimal compute-knowledge efficiency compared to alternative models

>[!Note] >This model card is for [omni-0-mini-preview](https://huggingface.co/omniomni/omni-0-mini-preview). All other models can be found on [Omni's HuggingFace page](https://huggingface.co/omniomni) # Benchmarks **Omni-0-mini-preview Benchmarks** | **Benchmark** | **Omni** | **Base** | **Alternative Models** | **Llama 3.2 3B** | **Gemma 3 4B** | **Llama 3.1 8B** | |-----------------------------------------|----------|----------|------------------------|------------------|----------------|------------------| | MMLU STEM (4-shot CoT) | **35.02** | 26.59 | | 33.28 | 40.82 | 52.22 | | MMLU Science (4-shot CoT) | **34.44** | 28.03 | | 33.47 | 42.93 | 52.54 | | MMLU Technology (4-shot CoT) | **41.07** | 30.86 | | 45.28 | 46.74 | 63.72 | | MMLU Engineering (4-shot CoT) | **37.50** | 25.93 | | 34.65 | 43.66 | 55.58 | | MMLU Math (4-shot CoT) | **35.54** | 23.86 | | 39.51 | 35.31 | 45.84 | | HumanEval (pass@1) | **31.71** | 29.88 | | 51.83 | 57.32 | 57.93 | | SciQ (0-shot) | **87.30** | 76.10 | | 93.30 | 87.50 | 91.80 | | MATH (4-shot) | 15.66 | **16.12**| | 28.44 | 26.38 | 29.56 | | ARC-Challenge (0-shot) | **43.00** | 40.10 | | 46.16 | 44.11 | 54.18 | | ARC-Easy (0-shot) | **66.67** | 58.54 | | 67.93 | 63.01 | 75.80 | | **Average** | **37.91** | 30.25 | | 38.33 | 43.91 | 54.22 | | **Improvement** | **25.32%**| Base | | | | |
**Expert Model Benchmarks** |Benchmark | Science | Technology | Engineering | Math | |-------------------------------|------------------|---------------------|----------------------|----------------| | MMLU Science (4-shot CoT) | 26.69 | -- | -- | -- | | SciQ (0-shot) | 85.80 | -- | -- | -- | | ARC-Challenge (0-shot) | 42.41 | -- | -- | -- | | ARC-Easy (0-shot) | 66.96 | -- | -- | -- | | MMLU Technology (4-shot CoT) | -- | 35.30 | -- | | | Humaneval (pass@1) | -- | 32.93 | -- | -- | | MMLU Engineering (4-shot CoT) | -- | -- | 32.07 | -- | | MMLU Math (4-shot CoT) | -- | -- | -- | 30.83 | | MATH (4-shot) | -- | -- | -- | 18.76 | | Expert Average | **36.28** | **34.83** | **32.07** | **28.82** | | Base Average | 35.59 | 29.79 | 30.86 | 22.57 | | Improvement | 1.94\% | 16.92\% | 3.92\% | 27.69\% | > [!Note] > Expert average refers to the average of the expert model for the STEM domain in focus while benchmarking # Models Omni comes in a total of 5 models: **Merged Model** - `omni-0-mini-preview` - Merged output of all four experts through DARE-TIES, delivering large improvements in performance in STEM domains compared to its base. **Experts** - `omni-0-science-preview` - Science expert finetuned on corpora of scientific wikipedia texts and academic papers, as well as a chat-templated scientific Q&A dataset - `omni-0-technology-preview` - Technology expert finetuned on chat-templated code generation data and stack exchange questions and top-voted answers - `omni-0-engineering-preview` - Engineering expert finetuned on corpora of engineering-related wikipedia texts and academic papers - `omni-0-math-preview` - Math expert finetuned on chat-templated math Q&A data All Omni experts are finetuned from their base model: [SmolLM2 1.7B Instruct](https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B-Instruct) on H100/Ada 6000/A100 GPUs, improving by 25.32% on average over all tested STEM benchmarks. # Features **Made for all** Omni is a series of highly efficient Large Language Models aimed at expanding the accessibility of AI, filling in its gaps among underserved populations. **Efficient** Omni operates at optimal compute-knowledge efficiency compared to similar models. **Merged architecture** Omni uses merging to provide the collective accuracy of specialized models, leveraging their capabilities to enhance the final merged model. **Multi-disciplinary** Omni's first variant achieves state-of-the-art performance across STEM compared to alternative optimization techniques. --- # Inference ## Transformers Transformers is a framework by HuggingFace unifying model development and inference, allowing for simple and seamless interactions with models found on HuggingFace. To get started with running inference using Omni, install transformers ```bash pip install transformers ``` After transformers has been installed, run the following code to generate outputs from any Omni model ```python from transformers import AutoModelForCausalLM, AutoTokenizer checkpoint = "omniomni/omni-0-mini-preview" # Can be any Omni model device = "cuda" # For GPU usage or "cpu" for CPU usage tokenizer = AutoTokenizer.from_pretrained(checkpoint) # For multiple GPUs install accelerate and do `model = AutoModelForCausalLM.from_pretrained(checkpoint, device_map="auto")` # However, Omni is small enough to run on individual commodity GPUs and low-resource devices model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device) messages = [{"role": "user", "content": "What is STEM?"}] input_text=tokenizer.apply_chat_template(messages, tokenize=False) inputs = tokenizer.encode(input_text, return_tensors="pt").to(device) outputs = model.generate(inputs, max_new_tokens=256, temperature=0.7, do_sample=True) print(tokenizer.decode(outputs[0])) ``` ## vLLM vLLM provides fast LLM inference to models spanning vast amounts of architectures through both server and in-file implementations. First, install vLLM's package ```bash uv pip install vllm --torch-backend=auto ``` After that, run this command to start a server with Omni via vLLM ```bash vllm serve omniomni/omni-0-mini-preview ``` To use Omni with vLLM without creating a server, run this code to generate outputs within a Python file ```python # vLLM automatically uses a GPU unless built with CPU wheels, so no need to specify a device from vllm import LLM, SamplingParams prompts = [ "Hello, my name is", "The president of the United States is", "The answer to x^2 + 2x + 1 is", "The future of AI is", ] sampling_params = SamplingParams(temperature=0.7) llm = LLM(model="omniomni/omni-0-mini-preview") outputs = llm.generate(prompts, sampling_params) for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}") ```