File size: 6,140 Bytes
58f2a78 67f4968 aee0f92 67f4968 aee0f92 58f2a78 67f4968 58f2a78 67f4968 58f2a78 67f4968 58f2a78 67f4968 58f2a78 67f4968 58f2a78 aee0f92 67f4968 aee0f92 58f2a78 67f4968 58f2a78 67f4968 aee0f92 67f4968 aee0f92 67f4968 aee0f92 67f4968 58f2a78 67f4968 58f2a78 67f4968 aee0f92 58f2a78 67f4968 58f2a78 67f4968 58f2a78 67f4968 d6228bf 67f4968 58f2a78 67f4968 58f2a78 67f4968 58f2a78 67f4968 d6228bf 67f4968 d6228bf 67f4968 d6228bf 67f4968 58f2a78 67f4968 58f2a78 67f4968 58f2a78 67f4968 d6228bf 67f4968 58f2a78 d6228bf 67f4968 a429c3a 67f4968 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 |
---
library_name: transformers
tags:
- qwen3
- qwen3moe
- mixture-of-experts
- llm
- text-generation
- instruction-following
- agentic-ai
- tool-use
- low-resource
- edge-ai
- from-scratch
- causal-lm
license: apache-2.0
datasets:
- kshitijthakkar/loggenix-mc-oraca-agentinstruct-1m-v1
---
# ๐ง LoggenixMoE133M: A Lightweight Mixture-of-Experts Language Model (8E2A)
[]()
[]()
[]()
[]()
[](https://www.apache.org/licenses/LICENSE-2.0)
---
## ๐ Model Card
**LoggenixMoE133M** is a small Mixture-of-Experts (MoE) Causal Language Model trained **from scratch** on a custom dataset containing root cause analysis (RCA), code generation, and reasoning tasks.
- **Architecture**: A lightweight transformer with Mixture-of-Experts routing, **inspired by the innovative architectural design of Qwen3 models.**
- **Parameter Count**: 133M total, with 2 experts active per token (approx. 80M active per step).
- **Experts**: 8 total, gated per token with router logits.
- **Activation Strategy**: Top-2 routing with auxiliary routing loss.
- **Tokenizer Features**: Crucially, the tokenizer includes dedicated special tokens for agentic capabilities: `<tool_call>` and `<think>`. These tokens are designed to facilitate advanced reasoning, planning, and interaction with external tools, enabling the model to serve as a foundational component for building robust AI agents.
---
## ๐ Training Details
| Attribute | Value |
|------------------------|------------------------------------------------|
| Total Params | 133M |
| MoE Config | 8 experts, top-2 gating |
| Dataset Type | RCA, code, and logic prompts (15+ task splits) |
| Training Epochs | 5 |
| Effective Tokens Seen | 1.5 Billion |
| Train Loss (final) | 3.263 |
| Val Loss (final) | 3.327 |
| Mean Token Accuracy | ~48% |
| Optimizer | AdamW |
| Scheduler | Linear Warmup + Cosine Decay |
| Precision | FP16 with GradScaler |
| Checkpoint Format | HF-compatible |
| Training Cost | $94 across Modal (A100 40GB) + Hyperbolic (RTX 4090) |
| Context Length | 1024 |
---
## ๐งช Intended Use
### โ
Suitable for:
- Instruction-following tasks
- Root cause analysis (RCA) and structured summarization
- Lightweight code generation (Python)
- Chain-of-thought style reasoning prompts
- **Fine-tuning for specific tasks on edge devices** (e.g., smart home voice assistants, mobile offline chatbots, industrial IoT anomaly detection)
- **Building specialized AI agents** that can reason, plan, and interact with external tools (e.g., automated customer support, workflow automation, personalized learning agents)
### ๐ซ Not suitable for:
- Long-context tasks (>4K tokens)
- High-stakes factual QA
- Safety-critical decision-making without oversight
---
## ๐งจ Example Usage
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("kshitijthakkar/loggenix-moe-0.12B-A0.08B-e5-lr5e4-b4-3060")
messages = [
{
"content": "",
"role": "system"
},
{
"content": "Write a Python function to compute factorial.",
"role": "user"
}
]
# Tokenizer
tokenizer.pad_token = tokenizer.eos_token
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")
model = AutoModelForCausalLM.from_pretrained("kshitijthakkar/loggenix-moe-0.12B-A0.08B-e5-lr5e4-b4-3060", device_map="auto")
memory = model.get_memory_footprint() / 1e6
print(f"Memory footprint: {memory:,.1f} MB")
model
outputs = model.generate(inputs, do_sample=True,use_cache=False,max_new_tokens=512)
print(tokenizer.decode(outputs[0]))
## Alternatively
with torch.no_grad():
outputs = model.generate(
inputs,
max_new_tokens=50, # Reduced for testing
do_sample=True,
temperature=0.5,
top_p=0.95,
return_dict_in_generate=True,
use_cache=False # Disable caching to avoid potential issues
)
generated_text = tokenizer.decode(outputs.sequences[0], skip_special_tokens=True)
print(generated_text)
```
---
๐ง Expert Routing
---
This model uses a top-2 gating mechanism where, for each token, two of the eight experts are selected based on learned router logits.
During training, a light auxiliary loss was applied to encourage balanced expert usage and improve routing stability.
Note: Routing logits are optionally available in the model outputs via output_router_logits=True.
---
๐ License
---
This model is released under the Apache 2.0 License.
---
๐ Acknowledgements
---
Trained using:
---
๐งจ Hugging Face Transformers
๐ง Custom training loop with gradient checkpointing
๐งฎ NVIDIA RTX 4090 (24GB VRAM) / A100 (40GB)
๐ฆ Logged and tracked via Weights & Biases
---
### ๐ฃ๏ธ Citation
---
@misc{loggenix-moe-0.12B-A0.08B-e5-lr5e4-b4-3060,
title = {loggenix-moe-0.12B-A0.08B-e5-lr5e4-b4-3060: A Lightweight Mixture-of-Experts Model},
author = {kshitijthakkar},
year = {2025},
url = {https://huggingface.co/kshitijthakkar/loggenix-moe-0.12B-A0.08B-e5-lr5e4-b4-3060 },
note = {Trained from scratch on RCA + code + reasoning dataset.}
}
--- |