arnomatic
/

german-moe-gpt-v8-pretrained

@@ -1,233 +1,213 @@
----
-language:
-- de
-license: mit
-library_name: transformers
-tags:
-- text-generation
-- pytorch
-- causal-lm
-- mixture-of-experts
-- moe
-- german
-- gpt
-- language-model
-base_model: []
-pipeline_tag: text-generation
-model-index:
-- name: german-moe-gpt-v8-pretrained
-  results: []
-datasets:
-- wikipedia
-widget:
-- text: "Die Hauptstadt von Deutschland ist"
-  example_title: "German Capital"
-- text: "Künstliche Intelligenz ist"
-  example_title: "AI Definition"
-- text: "Es war einmal"
-  example_title: "Story Beginning"
----
-# German MoE GPT v8 - OPUS EDITION
-A research-grade language model with state-of-the-art Mixture-of-Experts (MoE) architecture, trained on consumer hardware (RTX 4090). This implementation follows best practices from recent MoE research (ST-MoE, Switch Transformer) while maintaining full cross-platform compatibility.
-> **Note:** While this model was trained on German data, the architecture is language-agnostic and can be used for any language dataset. Simply replace the training corpus with your target language data.
-## Model Description
-This is a 149.6M parameter Mixture-of-Experts (MoE) language model trained on high-quality German text data. The model uses a hybrid architecture combining dense and sparse (MoE) layers for optimal parameter efficiency.
-### Key Features
-- 🏗️ **Hybrid Dense + MoE Architecture:** Every 2nd layer uses MoE for efficiency
-- 🔬 **Research-Backed:** Implements ST-MoE and Switch Transformer best practices
-- ⚡ **Efficient:** Only ~33% of parameters active per token
-- 🖥️ **Cross-Platform:** Pure PyTorch, runs on Windows/Linux/macOS
-- 🤗 **HuggingFace Compatible:** Full integration with `transformers` library
-## Model Specifications
-| Specification | Value |
-|--------------|-------|
-| Total Parameters | 149.6M |
-| Active Parameters per Token | ~49.9M (~33%) |
-| Vocabulary Size | 128,256 (Llama 3.2 Tokenizer) |
-| Context Length | 2048 tokens |
-| Architecture | Hybrid Dense + MoE Transformer |
-| Layers | 12 |
-| Hidden Size | 768 |
-| Attention Heads | 12 |
-| Experts per MoE Layer | 32 |
-| Active Experts (Top-k) | 2 |
-| Position Embeddings | RoPE (Rotary Position Embeddings) |
-## Training Data
-The model was trained on a 17.4 GB curated German corpus consisting of:
-- **Clean German Wikipedia** (~11 GB): Encyclopedic knowledge
-- **OpenSubtitles (German)**: Natural dialog and conversational language
-- **Belletristik**: German literature for style and creativity
-**Data Quality:** Deduplicated and SEO spam filtered for high-quality training signal.
-> **Adapting to other languages:** The architecture is language-agnostic. Replace the dataset with your target language corpus and retrain.
-## Training Details
-### Training Hyperparameters
-- **Steps:** 300,000
-- **Batch Size:** 32 (with gradient accumulation)
-- **Learning Rate:** 3e-4 (max)
-- **Hardware:** Single RTX 4090 (24GB VRAM)
-- **Training Time:** ~120 hours
-- **Precision:** Mixed (BF16)
-### Results
-| Metric | Initial | Final | Improvement |
-|--------|---------|-------|-------------|
-| Training Loss | 12.0 | 2.55 | 79% ↓ |
-| Validation Loss | 4.58 | 2.40 | 48% ↓ |
-| Perplexity | - | 11.0 | - |
-## Usage
-### Installation
-```bash
-pip install transformers torch
-```
-### Quick Start
-```python
-from transformers import AutoTokenizer, AutoModelForCausalLM
-# Load model and tokenizer
-model = AutoModelForCausalLM.from_pretrained("arnomatic/german-moe-gpt-v8-pretrained")
-tokenizer = AutoTokenizer.from_pretrained("arnomatic/german-moe-gpt-v8-pretrained")
-# Generate text
-prompt = "Die Hauptstadt von Deutschland ist"
-inputs = tokenizer(prompt, return_tensors="pt")
-outputs = model.generate(
-    **inputs,
-    max_new_tokens=100,
-    temperature=0.8,
-    top_k=50,
-    top_p=0.9,
-    do_sample=True
-)
-print(tokenizer.decode(outputs[0], skip_special_tokens=True))
-```
-### Advanced Usage
-```python
-# Generate with custom parameters
-outputs = model.generate(
-    **inputs,
-    max_new_tokens=200,
-    temperature=0.7,          # Lower = more deterministic
-    top_k=40,                 # Top-k sampling
-    top_p=0.95,               # Nucleus sampling
-    repetition_penalty=1.1,   # Reduce repetition
-    do_sample=True
-)
-```
-## Technical Architecture
-### MoE Layer Design
-The model uses a **Noisy Top-k Router** with the following components:
-1. **Gate Computation:** Learned routing weights per expert
-2. **Noise Injection:** Adds controlled noise during training for exploration
-3. **Top-k Selection:** Routes each token to the 2 best experts
-4. **Capacity Management:** Prevents expert overload with dynamic capacity limits
-5. **Load Balancing:** Auxiliary loss ensures uniform expert utilization
-### Loss Functions
-The training loss combines three components:
-```
-L_total = L_ce + α * L_aux + β * L_z
-```
-- **L_ce:** Cross-entropy language modeling loss
-- **L_aux:** Load balance loss (α = 0.01) for uniform expert utilization
-- **L_z:** Router z-loss (β = 0.001) for numerical stability
-### Attention Mechanism
-- **RoPE (Rotary Position Embeddings)** for position encoding
-- **PyTorch SDPA** with automatic backend selection (Flash Attention when available)
-- **Causal masking** for autoregressive generation
-### Optimizations
-- ✅ **Gradient Checkpointing:** ~40% VRAM reduction
-- ✅ **Mixed Precision (BF16):** 2x faster training
-- ✅ **Weight Tying:** LM head shares embeddings
-- ✅ **Batch Expert Processing:** Parallel computation for all experts
-## Limitations and Biases
-- **Language:** Primarily trained on German text
-- **Domain:** General domain (Wikipedia, literature, subtitles)
-- **Biases:** May reflect biases present in training data
-- **Context:** Limited to 2048 tokens
-- **Compute:** Requires GPU for efficient inference
-## Ethical Considerations
-This model is a language model and can generate text that may be:
-- Factually incorrect
-- Biased or stereotypical
-- Inappropriate or offensive
-Users should:
-- Verify generated content for factual accuracy
-- Be aware of potential biases
-- Use appropriate content filtering for production applications
-## Citation
-If you use this model in your research, please cite:
-```bibtex
-@misc{german-moe-gpt-v8,
-  title={German MoE GPT v8: A Research-Grade Mixture-of-Experts Language Model},
-  author={[Your Name]},
-  year={2025},
-  howpublished={\url{https://huggingface.co/arnomatic/german-moe-gpt-v8-pretrained}}
-}
-```
-## References
-This implementation is based on:
-- **ST-MoE:** Zoph et al. (2022) - [Designing Effective Sparse Expert Models](https://arxiv.org/abs/2202.08906)
-- **Switch Transformer:** Fedus et al. (2022) - [Switch Transformers: Scaling to Trillion Parameter Models](https://arxiv.org/abs/2101.03961)
-- **RoFormer:** Su et al. (2021) - [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/abs/2104.09864)
-## License
-MIT License - See LICENSE file for details
-## Acknowledgments
-- HuggingFace Transformers team for the excellent framework
-- PyTorch team for SDPA and optimized operations
-- nanoGPT/nanoMoE community for inspiration
-## Model Card Contact
-For questions or feedback, please open an issue in the [GitHub repository](https://github.com/accemlcc/german-moe-gpt-v8).

+---
+language:
+- de
+license: mit
+library_name: transformers
+tags:
+- text-generation
+- pytorch
+- causal-lm
+- mixture-of-experts
+- moe
+- german
+- gpt
+- language-model
+base_model: []
+pipeline_tag: text-generation
+model-index:
+- name: german-moe-gpt-v8-pretrained
+  results: []
+datasets:
+- wikipedia
+widget:
+- text: "Die Hauptstadt von Deutschland ist"
+  example_title: "German Capital"
+- text: "Künstliche Intelligenz ist"
+  example_title: "AI Definition"
+- text: "Es war einmal"
+  example_title: "Story Beginning"
+---
+# German MoE GPT v8 - OPUS EDITION
+A research-grade language model with state-of-the-art Mixture-of-Experts (MoE) architecture, trained on consumer hardware (RTX 4090). This implementation follows best practices from recent MoE research (ST-MoE, Switch Transformer) while maintaining full cross-platform compatibility.
+> **Note:** While this model was trained on German data, the architecture is language-agnostic and can be used for any language dataset. Simply replace the training corpus with your target language data.
+## Model Description
+This is a 149.6M parameter Mixture-of-Experts (MoE) language model trained on high-quality German text data. The model uses a hybrid architecture combining dense and sparse (MoE) layers for optimal parameter efficiency.
+### Key Features
+- 🏗️ **Hybrid Dense + MoE Architecture:** Every 2nd layer uses MoE for efficiency
+- 🔬 **Research-Backed:** Implements ST-MoE and Switch Transformer best practices
+- ⚡ **Efficient:** Only ~33% of parameters active per token
+- 🖥️ **Cross-Platform:** Pure PyTorch, runs on Windows/Linux/macOS
+- 🤗 **HuggingFace Compatible:** Full integration with `transformers` library
+## Model Specifications
+| Specification | Value |
+|--------------|-------|
+| Total Parameters | 149.6M |
+| Active Parameters per Token | ~49.9M (~33%) |
+| Vocabulary Size | 128,256 (Llama 3.2 Tokenizer) |
+| Context Length | 2048 tokens |
+| Architecture | Hybrid Dense + MoE Transformer |
+| Layers | 12 |
+| Hidden Size | 768 |
+| Attention Heads | 12 |
+| Experts per MoE Layer | 32 |
+| Active Experts (Top-k) | 2 |
+| Position Embeddings | RoPE (Rotary Position Embeddings) |
+## Training Data
+The model was trained on a 17.4 GB curated German corpus consisting of:
+- **Clean German Wikipedia** (~11 GB): Encyclopedic knowledge
+- **OpenSubtitles (German)**: Natural dialog and conversational language
+- **Belletristik**: German literature for style and creativity
+**Data Quality:** Deduplicated and SEO spam filtered for high-quality training signal.
+> **Adapting to other languages:** The architecture is language-agnostic. Replace the dataset with your target language corpus and retrain.
+## Training Details
+### Training Hyperparameters
+- **Steps:** 300,000
+- **Batch Size:** 32 (with gradient accumulation)
+- **Learning Rate:** 3e-4 (max)
+- **Hardware:** Single RTX 4090 (24GB VRAM)
+- **Training Time:** ~120 hours
+- **Precision:** Mixed (BF16)
+### Results
+| Metric | Initial | Final | Improvement |
+|--------|---------|-------|-------------|
+| Training Loss | 12.0 | 2.55 | 79% ↓ |
+| Validation Loss | 4.58 | 2.40 | 48% ↓ |
+| Perplexity | - | 11.0 | - |
+## Usage
+### Installation
+```bash
+pip install transformers torch
+```
+### Quick Start
+check -> inference.py
+### Advanced Usage
+```python
+# Generate with custom parameters
+outputs = model.generate(
+    **inputs,
+    max_new_tokens=200,
+    temperature=0.7,          # Lower = more deterministic
+    top_k=40,                 # Top-k sampling
+    top_p=0.95,               # Nucleus sampling
+    repetition_penalty=1.1,   # Reduce repetition
+    do_sample=True
+)
+```
+## Technical Architecture
+### MoE Layer Design
+The model uses a **Noisy Top-k Router** with the following components:
+1. **Gate Computation:** Learned routing weights per expert
+2. **Noise Injection:** Adds controlled noise during training for exploration
+3. **Top-k Selection:** Routes each token to the 2 best experts
+4. **Capacity Management:** Prevents expert overload with dynamic capacity limits
+5. **Load Balancing:** Auxiliary loss ensures uniform expert utilization
+### Loss Functions
+The training loss combines three components:
+```
+L_total = L_ce + α * L_aux + β * L_z
+```
+- **L_ce:** Cross-entropy language modeling loss
+- **L_aux:** Load balance loss (α = 0.01) for uniform expert utilization
+- **L_z:** Router z-loss (β = 0.001) for numerical stability
+### Attention Mechanism
+- **RoPE (Rotary Position Embeddings)** for position encoding
+- **PyTorch SDPA** with automatic backend selection (Flash Attention when available)
+- **Causal masking** for autoregressive generation
+### Optimizations
+- ✅ **Gradient Checkpointing:** ~40% VRAM reduction
+- ✅ **Mixed Precision (BF16):** 2x faster training
+- ✅ **Weight Tying:** LM head shares embeddings
+- ✅ **Batch Expert Processing:** Parallel computation for all experts
+## Limitations and Biases
+- **Language:** Primarily trained on German text
+- **Domain:** General domain (Wikipedia, literature, subtitles)
+- **Biases:** May reflect biases present in training data
+- **Context:** Limited to 2048 tokens
+- **Compute:** Requires GPU for efficient inference
+## Ethical Considerations
+This model is a language model and can generate text that may be:
+- Factually incorrect
+- Biased or stereotypical
+- Inappropriate or offensive
+Users should:
+- Verify generated content for factual accuracy
+- Be aware of potential biases
+- Use appropriate content filtering for production applications
+## Citation
+If you use this model in your research, please cite:
+```bibtex
+@misc{german-moe-gpt-v8,
+  title={German MoE GPT v8: A Research-Grade Mixture-of-Experts Language Model},
+  author={[Your Name]},
+  year={2025},
+  howpublished={\url{https://huggingface.co/arnomatic/german-moe-gpt-v8-pretrained}}
+}
+```
+## References
+This implementation is based on:
+- **ST-MoE:** Zoph et al. (2022) - [Designing Effective Sparse Expert Models](https://arxiv.org/abs/2202.08906)
+- **Switch Transformer:** Fedus et al. (2022) - [Switch Transformers: Scaling to Trillion Parameter Models](https://arxiv.org/abs/2101.03961)
+- **RoFormer:** Su et al. (2021) - [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/abs/2104.09864)
+## License
+MIT License - See LICENSE file for details
+## Acknowledgments
+- HuggingFace Transformers team for the excellent framework
+- PyTorch team for SDPA and optimized operations
+- nanoGPT/nanoMoE community for inspiration
+## Model Card Contact
+For questions or feedback, please open an issue in the [GitHub repository](https://github.com/accemlcc/german-moe-gpt-v8).