kshitijthakkar
/

loggenix-moe-0.12B-A0.08B-e5-lr5e4-b4-3060

Text Generation

mixture-of-experts

instruction-following

Model card Files Files and versions

kshitijthakkar commited on Jul 10

Commit

d6228bf

·

verified ·

1 Parent(s): a429c3a

Update README.md

Files changed (1) hide show

README.md +6 -1

README.md CHANGED Viewed

@@ -107,6 +107,7 @@ with torch.no_grad():
 ---
 🔧 Expert Routing
 ---
 This model uses a top-2 gating mechanism where, for each token, two of the eight experts are selected based on learned router logits.
 During training, a light auxiliary loss was applied to encourage balanced expert usage and improve routing stability.
@@ -116,12 +117,15 @@ Note: Routing logits are optionally available in the model outputs via output_ro
 ---
 📃 License
 ---
 This model is released under the Apache 2.0 License.
 ---
 🙌 Acknowledgements
 ---
 Trained using:
 ---
 🧨 Hugging Face Transformers
 🧠 Custom training loop with gradient checkpointing
@@ -129,9 +133,10 @@ Trained using:
 🧮 NVIDIA RTX 4090 (24GB VRAM) / A100 (40GB)
 📦 Logged and tracked via Weights & Biases
 ---
-### 🗣️ Citation
 ---
 @misc{loggenix-moe-0.12B-A0.08B-e5-lr5e4-b4-3060,
   title = {loggenix-moe-0.12B-A0.08B-e5-lr5e4-b4-3060: A Lightweight Mixture-of-Experts Model},

 ---
 🔧 Expert Routing
 ---
 This model uses a top-2 gating mechanism where, for each token, two of the eight experts are selected based on learned router logits.
 During training, a light auxiliary loss was applied to encourage balanced expert usage and improve routing stability.
 ---
 📃 License
 ---
 This model is released under the Apache 2.0 License.
 ---
 🙌 Acknowledgements
 ---
 Trained using:
 ---
 🧨 Hugging Face Transformers
 🧠 Custom training loop with gradient checkpointing
 🧮 NVIDIA RTX 4090 (24GB VRAM) / A100 (40GB)
 📦 Logged and tracked via Weights & Biases
 ---
+### 🗣️ Citation
 ---
 @misc{loggenix-moe-0.12B-A0.08B-e5-lr5e4-b4-3060,
   title = {loggenix-moe-0.12B-A0.08B-e5-lr5e4-b4-3060: A Lightweight Mixture-of-Experts Model},