File size: 6,140 Bytes
58f2a78
 
67f4968
 
 
aee0f92
 
 
 
 
 
 
 
 
 
67f4968
 
 
aee0f92
58f2a78
67f4968
58f2a78
67f4968
 
 
 
 
58f2a78
67f4968
58f2a78
67f4968
58f2a78
67f4968
58f2a78
aee0f92
67f4968
 
 
aee0f92
58f2a78
67f4968
58f2a78
67f4968
 
 
 
 
 
 
 
aee0f92
67f4968
 
aee0f92
67f4968
 
 
 
aee0f92
 
67f4968
58f2a78
67f4968
58f2a78
67f4968
 
 
 
 
aee0f92
 
58f2a78
67f4968
 
 
 
58f2a78
67f4968
58f2a78
67f4968
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d6228bf
67f4968
58f2a78
67f4968
58f2a78
67f4968
58f2a78
67f4968
 
 
d6228bf
67f4968
d6228bf
67f4968
 
 
 
 
d6228bf
67f4968
58f2a78
67f4968
58f2a78
67f4968
58f2a78
67f4968
d6228bf
67f4968
58f2a78
d6228bf
67f4968
 
 
 
 
a429c3a
67f4968
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
---
library_name: transformers
tags:
- qwen3
- qwen3moe
- mixture-of-experts
- llm
- text-generation
- instruction-following
- agentic-ai
- tool-use
- low-resource
- edge-ai
- from-scratch
- causal-lm
license: apache-2.0
datasets:
- kshitijthakkar/loggenix-mc-oraca-agentinstruct-1m-v1

---
# ๐Ÿง  LoggenixMoE133M: A Lightweight Mixture-of-Experts Language Model (8E2A)

[![Model Size](https://img.shields.io/badge/Parameters-133M-blue)]()
[![Experts](https://img.shields.io/badge/Experts-8-lightgrey)]()
[![Routing](https://img.shields.io/badge/Active_Experts-2-orange)]()
[![Model Size](https://img.shields.io/badge/ActiveParameters-80M-red)]()
[![License: Apache 2.0](https://img.shields.io/badge/License-Apache_2.0-green.svg)](https://www.apache.org/licenses/LICENSE-2.0)

---

## ๐Ÿ“ Model Card

**LoggenixMoE133M** is a small Mixture-of-Experts (MoE) Causal Language Model trained **from scratch** on a custom dataset containing root cause analysis (RCA), code generation, and reasoning tasks.

- **Architecture**: A lightweight transformer with Mixture-of-Experts routing, **inspired by the innovative architectural design of Qwen3 models.**
- **Parameter Count**: 133M total, with 2 experts active per token (approx. 80M active per step).
- **Experts**: 8 total, gated per token with router logits.
- **Activation Strategy**: Top-2 routing with auxiliary routing loss.
- **Tokenizer Features**: Crucially, the tokenizer includes dedicated special tokens for agentic capabilities: `<tool_call>` and `<think>`. These tokens are designed to facilitate advanced reasoning, planning, and interaction with external tools, enabling the model to serve as a foundational component for building robust AI agents.

---

## ๐Ÿ“Š Training Details

| Attribute               | Value                                          |
|------------------------|------------------------------------------------|
| Total Params           | 133M                                           |
| MoE Config             | 8 experts, top-2 gating                        |
| Dataset Type           | RCA, code, and logic prompts (15+ task splits) |
| Training Epochs        | 5                                              |
| Effective Tokens Seen  | 1.5 Billion                                    | 
| Train Loss (final)     | 3.263                                          |
| Val Loss (final)       | 3.327                                          |
| Mean Token Accuracy    | ~48%                                           | 
| Optimizer              | AdamW                                          |
| Scheduler              | Linear Warmup + Cosine Decay                   |
| Precision              | FP16 with GradScaler                           |
| Checkpoint Format      | HF-compatible                                  |
| Training Cost          | $94 across Modal (A100 40GB) + Hyperbolic (RTX 4090) |
| Context Length         | 1024                                           |
---

## ๐Ÿงช Intended Use

### โœ… Suitable for:
- Instruction-following tasks
- Root cause analysis (RCA) and structured summarization
- Lightweight code generation (Python)
- Chain-of-thought style reasoning prompts
- **Fine-tuning for specific tasks on edge devices** (e.g., smart home voice assistants, mobile offline chatbots, industrial IoT anomaly detection)
- **Building specialized AI agents** that can reason, plan, and interact with external tools (e.g., automated customer support, workflow automation, personalized learning agents)

### ๐Ÿšซ Not suitable for:
- Long-context tasks (>4K tokens)
- High-stakes factual QA
- Safety-critical decision-making without oversight

---

## ๐Ÿงจ Example Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("kshitijthakkar/loggenix-moe-0.12B-A0.08B-e5-lr5e4-b4-3060")

messages = [
   {
       "content": "",
       "role": "system"
   },
   {
       "content": "Write a Python function to compute factorial.",
        "role": "user"
    }
 ]
# Tokenizer

tokenizer.pad_token = tokenizer.eos_token
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")
model = AutoModelForCausalLM.from_pretrained("kshitijthakkar/loggenix-moe-0.12B-A0.08B-e5-lr5e4-b4-3060", device_map="auto")
memory = model.get_memory_footprint() / 1e6
print(f"Memory footprint: {memory:,.1f} MB")
model
outputs = model.generate(inputs, do_sample=True,use_cache=False,max_new_tokens=512)
print(tokenizer.decode(outputs[0]))


## Alternatively
with torch.no_grad():
  outputs = model.generate(
                       inputs,
                        max_new_tokens=50,  # Reduced for testing
                        do_sample=True,
                        temperature=0.5,
                        top_p=0.95,
                        return_dict_in_generate=True,
                        use_cache=False  # Disable caching to avoid potential issues
                    )
  generated_text = tokenizer.decode(outputs.sequences[0], skip_special_tokens=True)
  print(generated_text)
```
---
๐Ÿ”ง Expert Routing
---

This model uses a top-2 gating mechanism where, for each token, two of the eight experts are selected based on learned router logits.

During training, a light auxiliary loss was applied to encourage balanced expert usage and improve routing stability.

Note: Routing logits are optionally available in the model outputs via output_router_logits=True.

---
๐Ÿ“ƒ License
---

This model is released under the Apache 2.0 License.

---
๐Ÿ™Œ Acknowledgements
---
Trained using:
---

๐Ÿงจ Hugging Face Transformers

๐Ÿง  Custom training loop with gradient checkpointing

๐Ÿงฎ NVIDIA RTX 4090 (24GB VRAM) / A100 (40GB)

๐Ÿ“ฆ Logged and tracked via Weights & Biases

---

### ๐Ÿ—ฃ๏ธ Citation
---
@misc{loggenix-moe-0.12B-A0.08B-e5-lr5e4-b4-3060,
  title = {loggenix-moe-0.12B-A0.08B-e5-lr5e4-b4-3060: A Lightweight Mixture-of-Experts Model},
  author = {kshitijthakkar},
  year = {2025},
  url = {https://huggingface.co/kshitijthakkar/loggenix-moe-0.12B-A0.08B-e5-lr5e4-b4-3060 },
  note = {Trained from scratch on RCA + code + reasoning dataset.}
}
---