File size: 6,140 Bytes

---
library_name: transformers
tags:
- qwen3
- qwen3moe
- mixture-of-experts
- llm
- text-generation
- instruction-following
- agentic-ai
- tool-use
- low-resource
- edge-ai
- from-scratch
- causal-lm
license: apache-2.0
datasets:
- kshitijthakkar/loggenix-mc-oraca-agentinstruct-1m-v1

---
# 🧠 LoggenixMoE133M: A Lightweight Mixture-of-Experts Language Model (8E2A)

[![Model Size](https://img.shields.io/badge/Parameters-133M-blue)]()
[![Experts](https://img.shields.io/badge/Experts-8-lightgrey)]()
[![Routing](https://img.shields.io/badge/Active_Experts-2-orange)]()
[![Model Size](https://img.shields.io/badge/ActiveParameters-80M-red)]()
[![License: Apache 2.0](https://img.shields.io/badge/License-Apache_2.0-green.svg)](https://www.apache.org/licenses/LICENSE-2.0)

---

## 📝 Model Card

**LoggenixMoE133M** is a small Mixture-of-Experts (MoE) Causal Language Model trained **from scratch** on a custom dataset containing root cause analysis (RCA), code generation, and reasoning tasks.

- **Architecture**: A lightweight transformer with Mixture-of-Experts routing, **inspired by the innovative architectural design of Qwen3 models.**
- **Parameter Count**: 133M total, with 2 experts active per token (approx. 80M active per step).
- **Experts**: 8 total, gated per token with router logits.
- **Activation Strategy**: Top-2 routing with auxiliary routing loss.
- **Tokenizer Features**: Crucially, the tokenizer includes dedicated special tokens for agentic capabilities: `<tool_call>` and `<think>`. These tokens are designed to facilitate advanced reasoning, planning, and interaction with external tools, enabling the model to serve as a foundational component for building robust AI agents.

---

## 📊 Training Details

| Attribute               | Value                                          |
|------------------------|------------------------------------------------|
| Total Params           | 133M                                           |
| MoE Config             | 8 experts, top-2 gating                        |
| Dataset Type           | RCA, code, and logic prompts (15+ task splits) |
| Training Epochs        | 5                                              |
| Effective Tokens Seen  | 1.5 Billion                                    | 
| Train Loss (final)     | 3.263                                          |
| Val Loss (final)       | 3.327                                          |
| Mean Token Accuracy    | ~48%                                           | 
| Optimizer              | AdamW                                          |
| Scheduler              | Linear Warmup + Cosine Decay                   |
| Precision              | FP16 with GradScaler                           |
| Checkpoint Format      | HF-compatible                                  |
| Training Cost          | $94 across Modal (A100 40GB) + Hyperbolic (RTX 4090) |
| Context Length         | 1024                                           |
---

## 🧪 Intended Use

### ✅ Suitable for:
- Instruction-following tasks
- Root cause analysis (RCA) and structured summarization
- Lightweight code generation (Python)
- Chain-of-thought style reasoning prompts
- **Fine-tuning for specific tasks on edge devices** (e.g., smart home voice assistants, mobile offline chatbots, industrial IoT anomaly detection)
- **Building specialized AI agents** that can reason, plan, and interact with external tools (e.g., automated customer support, workflow automation, personalized learning agents)

### 🚫 Not suitable for:
- Long-context tasks (>4K tokens)
- High-stakes factual QA
- Safety-critical decision-making without oversight

---

## 🧨 Example Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("kshitijthakkar/loggenix-moe-0.12B-A0.08B-e5-lr5e4-b4-3060")

messages = [
   {
       "content": "",
       "role": "system"
   },
   {
       "content": "Write a Python function to compute factorial.",
        "role": "user"
    }
 ]
# Tokenizer

tokenizer.pad_token = tokenizer.eos_token
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")
model = AutoModelForCausalLM.from_pretrained("kshitijthakkar/loggenix-moe-0.12B-A0.08B-e5-lr5e4-b4-3060", device_map="auto")
memory = model.get_memory_footprint() / 1e6
print(f"Memory footprint: {memory:,.1f} MB")
model
outputs = model.generate(inputs, do_sample=True,use_cache=False,max_new_tokens=512)
print(tokenizer.decode(outputs[0]))


## Alternatively
with torch.no_grad():
  outputs = model.generate(
                       inputs,
                        max_new_tokens=50,  # Reduced for testing
                        do_sample=True,
                        temperature=0.5,
                        top_p=0.95,
                        return_dict_in_generate=True,
                        use_cache=False  # Disable caching to avoid potential issues
                    )
  generated_text = tokenizer.decode(outputs.sequences[0], skip_special_tokens=True)
  print(generated_text)
```
---
🔧 Expert Routing
---

This model uses a top-2 gating mechanism where, for each token, two of the eight experts are selected based on learned router logits.

During training, a light auxiliary loss was applied to encourage balanced expert usage and improve routing stability.

Note: Routing logits are optionally available in the model outputs via output_router_logits=True.

---
📃 License
---

This model is released under the Apache 2.0 License.

---
🙌 Acknowledgements
---
Trained using:
---

🧨 Hugging Face Transformers

🧠 Custom training loop with gradient checkpointing

🧮 NVIDIA RTX 4090 (24GB VRAM) / A100 (40GB)

📦 Logged and tracked via Weights & Biases

---

### 🗣️ Citation
---
@misc{loggenix-moe-0.12B-A0.08B-e5-lr5e4-b4-3060,
  title = {loggenix-moe-0.12B-A0.08B-e5-lr5e4-b4-3060: A Lightweight Mixture-of-Experts Model},
  author = {kshitijthakkar},
  year = {2025},
  url = {https://huggingface.co/kshitijthakkar/loggenix-moe-0.12B-A0.08B-e5-lr5e4-b4-3060 },
  note = {Trained from scratch on RCA + code + reasoning dataset.}
}
---