arnomatic commited on
Commit
82756bb
·
verified ·
1 Parent(s): 3479481

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +213 -233
README.md CHANGED
@@ -1,233 +1,213 @@
1
- ---
2
- language:
3
- - de
4
- license: mit
5
- library_name: transformers
6
- tags:
7
- - text-generation
8
- - pytorch
9
- - causal-lm
10
- - mixture-of-experts
11
- - moe
12
- - german
13
- - gpt
14
- - language-model
15
- base_model: []
16
- pipeline_tag: text-generation
17
- model-index:
18
- - name: german-moe-gpt-v8-pretrained
19
- results: []
20
- datasets:
21
- - wikipedia
22
- widget:
23
- - text: "Die Hauptstadt von Deutschland ist"
24
- example_title: "German Capital"
25
- - text: "Künstliche Intelligenz ist"
26
- example_title: "AI Definition"
27
- - text: "Es war einmal"
28
- example_title: "Story Beginning"
29
- ---
30
-
31
- # German MoE GPT v8 - OPUS EDITION
32
-
33
- A research-grade language model with state-of-the-art Mixture-of-Experts (MoE) architecture, trained on consumer hardware (RTX 4090). This implementation follows best practices from recent MoE research (ST-MoE, Switch Transformer) while maintaining full cross-platform compatibility.
34
-
35
- > **Note:** While this model was trained on German data, the architecture is language-agnostic and can be used for any language dataset. Simply replace the training corpus with your target language data.
36
-
37
- ## Model Description
38
-
39
- This is a 149.6M parameter Mixture-of-Experts (MoE) language model trained on high-quality German text data. The model uses a hybrid architecture combining dense and sparse (MoE) layers for optimal parameter efficiency.
40
-
41
- ### Key Features
42
-
43
- - 🏗️ **Hybrid Dense + MoE Architecture:** Every 2nd layer uses MoE for efficiency
44
- - 🔬 **Research-Backed:** Implements ST-MoE and Switch Transformer best practices
45
- - ⚡ **Efficient:** Only ~33% of parameters active per token
46
- - 🖥️ **Cross-Platform:** Pure PyTorch, runs on Windows/Linux/macOS
47
- - 🤗 **HuggingFace Compatible:** Full integration with `transformers` library
48
-
49
- ## Model Specifications
50
-
51
- | Specification | Value |
52
- |--------------|-------|
53
- | Total Parameters | 149.6M |
54
- | Active Parameters per Token | ~49.9M (~33%) |
55
- | Vocabulary Size | 128,256 (Llama 3.2 Tokenizer) |
56
- | Context Length | 2048 tokens |
57
- | Architecture | Hybrid Dense + MoE Transformer |
58
- | Layers | 12 |
59
- | Hidden Size | 768 |
60
- | Attention Heads | 12 |
61
- | Experts per MoE Layer | 32 |
62
- | Active Experts (Top-k) | 2 |
63
- | Position Embeddings | RoPE (Rotary Position Embeddings) |
64
-
65
- ## Training Data
66
-
67
- The model was trained on a 17.4 GB curated German corpus consisting of:
68
-
69
- - **Clean German Wikipedia** (~11 GB): Encyclopedic knowledge
70
- - **OpenSubtitles (German)**: Natural dialog and conversational language
71
- - **Belletristik**: German literature for style and creativity
72
-
73
- **Data Quality:** Deduplicated and SEO spam filtered for high-quality training signal.
74
-
75
- > **Adapting to other languages:** The architecture is language-agnostic. Replace the dataset with your target language corpus and retrain.
76
-
77
- ## Training Details
78
-
79
- ### Training Hyperparameters
80
-
81
- - **Steps:** 300,000
82
- - **Batch Size:** 32 (with gradient accumulation)
83
- - **Learning Rate:** 3e-4 (max)
84
- - **Hardware:** Single RTX 4090 (24GB VRAM)
85
- - **Training Time:** ~120 hours
86
- - **Precision:** Mixed (BF16)
87
-
88
- ### Results
89
-
90
- | Metric | Initial | Final | Improvement |
91
- |--------|---------|-------|-------------|
92
- | Training Loss | 12.0 | 2.55 | 79% ↓ |
93
- | Validation Loss | 4.58 | 2.40 | 48% ↓ |
94
- | Perplexity | - | 11.0 | - |
95
-
96
- ## Usage
97
-
98
- ### Installation
99
-
100
- ```bash
101
- pip install transformers torch
102
- ```
103
-
104
- ### Quick Start
105
-
106
- ```python
107
- from transformers import AutoTokenizer, AutoModelForCausalLM
108
-
109
- # Load model and tokenizer
110
- model = AutoModelForCausalLM.from_pretrained("arnomatic/german-moe-gpt-v8-pretrained")
111
- tokenizer = AutoTokenizer.from_pretrained("arnomatic/german-moe-gpt-v8-pretrained")
112
-
113
- # Generate text
114
- prompt = "Die Hauptstadt von Deutschland ist"
115
- inputs = tokenizer(prompt, return_tensors="pt")
116
- outputs = model.generate(
117
- **inputs,
118
- max_new_tokens=100,
119
- temperature=0.8,
120
- top_k=50,
121
- top_p=0.9,
122
- do_sample=True
123
- )
124
-
125
- print(tokenizer.decode(outputs[0], skip_special_tokens=True))
126
- ```
127
-
128
- ### Advanced Usage
129
-
130
- ```python
131
- # Generate with custom parameters
132
- outputs = model.generate(
133
- **inputs,
134
- max_new_tokens=200,
135
- temperature=0.7, # Lower = more deterministic
136
- top_k=40, # Top-k sampling
137
- top_p=0.95, # Nucleus sampling
138
- repetition_penalty=1.1, # Reduce repetition
139
- do_sample=True
140
- )
141
- ```
142
-
143
- ## Technical Architecture
144
-
145
- ### MoE Layer Design
146
-
147
- The model uses a **Noisy Top-k Router** with the following components:
148
-
149
- 1. **Gate Computation:** Learned routing weights per expert
150
- 2. **Noise Injection:** Adds controlled noise during training for exploration
151
- 3. **Top-k Selection:** Routes each token to the 2 best experts
152
- 4. **Capacity Management:** Prevents expert overload with dynamic capacity limits
153
- 5. **Load Balancing:** Auxiliary loss ensures uniform expert utilization
154
-
155
- ### Loss Functions
156
-
157
- The training loss combines three components:
158
-
159
- ```
160
- L_total = L_ce + α * L_aux + β * L_z
161
- ```
162
-
163
- - **L_ce:** Cross-entropy language modeling loss
164
- - **L_aux:** Load balance loss = 0.01) for uniform expert utilization
165
- - **L_z:** Router z-loss = 0.001) for numerical stability
166
-
167
- ### Attention Mechanism
168
-
169
- - **RoPE (Rotary Position Embeddings)** for position encoding
170
- - **PyTorch SDPA** with automatic backend selection (Flash Attention when available)
171
- - **Causal masking** for autoregressive generation
172
-
173
- ### Optimizations
174
-
175
- - ✅ **Gradient Checkpointing:** ~40% VRAM reduction
176
- - **Mixed Precision (BF16):** 2x faster training
177
- - **Weight Tying:** LM head shares embeddings
178
- - **Batch Expert Processing:** Parallel computation for all experts
179
-
180
- ## Limitations and Biases
181
-
182
- - **Language:** Primarily trained on German text
183
- - **Domain:** General domain (Wikipedia, literature, subtitles)
184
- - **Biases:** May reflect biases present in training data
185
- - **Context:** Limited to 2048 tokens
186
- - **Compute:** Requires GPU for efficient inference
187
-
188
- ## Ethical Considerations
189
-
190
- This model is a language model and can generate text that may be:
191
- - Factually incorrect
192
- - Biased or stereotypical
193
- - Inappropriate or offensive
194
-
195
- Users should:
196
- - Verify generated content for factual accuracy
197
- - Be aware of potential biases
198
- - Use appropriate content filtering for production applications
199
-
200
- ## Citation
201
-
202
- If you use this model in your research, please cite:
203
-
204
- ```bibtex
205
- @misc{german-moe-gpt-v8,
206
- title={German MoE GPT v8: A Research-Grade Mixture-of-Experts Language Model},
207
- author={[Your Name]},
208
- year={2025},
209
- howpublished={\url{https://huggingface.co/arnomatic/german-moe-gpt-v8-pretrained}}
210
- }
211
- ```
212
-
213
- ## References
214
-
215
- This implementation is based on:
216
-
217
- - **ST-MoE:** Zoph et al. (2022) - [Designing Effective Sparse Expert Models](https://arxiv.org/abs/2202.08906)
218
- - **Switch Transformer:** Fedus et al. (2022) - [Switch Transformers: Scaling to Trillion Parameter Models](https://arxiv.org/abs/2101.03961)
219
- - **RoFormer:** Su et al. (2021) - [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/abs/2104.09864)
220
-
221
- ## License
222
-
223
- MIT License - See LICENSE file for details
224
-
225
- ## Acknowledgments
226
-
227
- - HuggingFace Transformers team for the excellent framework
228
- - PyTorch team for SDPA and optimized operations
229
- - nanoGPT/nanoMoE community for inspiration
230
-
231
- ## Model Card Contact
232
-
233
- For questions or feedback, please open an issue in the [GitHub repository](https://github.com/accemlcc/german-moe-gpt-v8).
 
1
+ ---
2
+ language:
3
+ - de
4
+ license: mit
5
+ library_name: transformers
6
+ tags:
7
+ - text-generation
8
+ - pytorch
9
+ - causal-lm
10
+ - mixture-of-experts
11
+ - moe
12
+ - german
13
+ - gpt
14
+ - language-model
15
+ base_model: []
16
+ pipeline_tag: text-generation
17
+ model-index:
18
+ - name: german-moe-gpt-v8-pretrained
19
+ results: []
20
+ datasets:
21
+ - wikipedia
22
+ widget:
23
+ - text: "Die Hauptstadt von Deutschland ist"
24
+ example_title: "German Capital"
25
+ - text: "Künstliche Intelligenz ist"
26
+ example_title: "AI Definition"
27
+ - text: "Es war einmal"
28
+ example_title: "Story Beginning"
29
+ ---
30
+
31
+ # German MoE GPT v8 - OPUS EDITION
32
+
33
+ A research-grade language model with state-of-the-art Mixture-of-Experts (MoE) architecture, trained on consumer hardware (RTX 4090). This implementation follows best practices from recent MoE research (ST-MoE, Switch Transformer) while maintaining full cross-platform compatibility.
34
+
35
+ > **Note:** While this model was trained on German data, the architecture is language-agnostic and can be used for any language dataset. Simply replace the training corpus with your target language data.
36
+
37
+ ## Model Description
38
+
39
+ This is a 149.6M parameter Mixture-of-Experts (MoE) language model trained on high-quality German text data. The model uses a hybrid architecture combining dense and sparse (MoE) layers for optimal parameter efficiency.
40
+
41
+ ### Key Features
42
+
43
+ - 🏗️ **Hybrid Dense + MoE Architecture:** Every 2nd layer uses MoE for efficiency
44
+ - 🔬 **Research-Backed:** Implements ST-MoE and Switch Transformer best practices
45
+ - ⚡ **Efficient:** Only ~33% of parameters active per token
46
+ - 🖥️ **Cross-Platform:** Pure PyTorch, runs on Windows/Linux/macOS
47
+ - 🤗 **HuggingFace Compatible:** Full integration with `transformers` library
48
+
49
+ ## Model Specifications
50
+
51
+ | Specification | Value |
52
+ |--------------|-------|
53
+ | Total Parameters | 149.6M |
54
+ | Active Parameters per Token | ~49.9M (~33%) |
55
+ | Vocabulary Size | 128,256 (Llama 3.2 Tokenizer) |
56
+ | Context Length | 2048 tokens |
57
+ | Architecture | Hybrid Dense + MoE Transformer |
58
+ | Layers | 12 |
59
+ | Hidden Size | 768 |
60
+ | Attention Heads | 12 |
61
+ | Experts per MoE Layer | 32 |
62
+ | Active Experts (Top-k) | 2 |
63
+ | Position Embeddings | RoPE (Rotary Position Embeddings) |
64
+
65
+ ## Training Data
66
+
67
+ The model was trained on a 17.4 GB curated German corpus consisting of:
68
+
69
+ - **Clean German Wikipedia** (~11 GB): Encyclopedic knowledge
70
+ - **OpenSubtitles (German)**: Natural dialog and conversational language
71
+ - **Belletristik**: German literature for style and creativity
72
+
73
+ **Data Quality:** Deduplicated and SEO spam filtered for high-quality training signal.
74
+
75
+ > **Adapting to other languages:** The architecture is language-agnostic. Replace the dataset with your target language corpus and retrain.
76
+
77
+ ## Training Details
78
+
79
+ ### Training Hyperparameters
80
+
81
+ - **Steps:** 300,000
82
+ - **Batch Size:** 32 (with gradient accumulation)
83
+ - **Learning Rate:** 3e-4 (max)
84
+ - **Hardware:** Single RTX 4090 (24GB VRAM)
85
+ - **Training Time:** ~120 hours
86
+ - **Precision:** Mixed (BF16)
87
+
88
+ ### Results
89
+
90
+ | Metric | Initial | Final | Improvement |
91
+ |--------|---------|-------|-------------|
92
+ | Training Loss | 12.0 | 2.55 | 79% ↓ |
93
+ | Validation Loss | 4.58 | 2.40 | 48% ↓ |
94
+ | Perplexity | - | 11.0 | - |
95
+
96
+ ## Usage
97
+
98
+ ### Installation
99
+
100
+ ```bash
101
+ pip install transformers torch
102
+ ```
103
+
104
+ ### Quick Start
105
+
106
+ check -> inference.py
107
+
108
+ ### Advanced Usage
109
+
110
+ ```python
111
+ # Generate with custom parameters
112
+ outputs = model.generate(
113
+ **inputs,
114
+ max_new_tokens=200,
115
+ temperature=0.7, # Lower = more deterministic
116
+ top_k=40, # Top-k sampling
117
+ top_p=0.95, # Nucleus sampling
118
+ repetition_penalty=1.1, # Reduce repetition
119
+ do_sample=True
120
+ )
121
+ ```
122
+
123
+ ## Technical Architecture
124
+
125
+ ### MoE Layer Design
126
+
127
+ The model uses a **Noisy Top-k Router** with the following components:
128
+
129
+ 1. **Gate Computation:** Learned routing weights per expert
130
+ 2. **Noise Injection:** Adds controlled noise during training for exploration
131
+ 3. **Top-k Selection:** Routes each token to the 2 best experts
132
+ 4. **Capacity Management:** Prevents expert overload with dynamic capacity limits
133
+ 5. **Load Balancing:** Auxiliary loss ensures uniform expert utilization
134
+
135
+ ### Loss Functions
136
+
137
+ The training loss combines three components:
138
+
139
+ ```
140
+ L_total = L_ce + α * L_aux + β * L_z
141
+ ```
142
+
143
+ - **L_ce:** Cross-entropy language modeling loss
144
+ - **L_aux:** Load balance loss (α = 0.01) for uniform expert utilization
145
+ - **L_z:** Router z-loss (β = 0.001) for numerical stability
146
+
147
+ ### Attention Mechanism
148
+
149
+ - **RoPE (Rotary Position Embeddings)** for position encoding
150
+ - **PyTorch SDPA** with automatic backend selection (Flash Attention when available)
151
+ - **Causal masking** for autoregressive generation
152
+
153
+ ### Optimizations
154
+
155
+ - **Gradient Checkpointing:** ~40% VRAM reduction
156
+ - ✅ **Mixed Precision (BF16):** 2x faster training
157
+ - **Weight Tying:** LM head shares embeddings
158
+ - ✅ **Batch Expert Processing:** Parallel computation for all experts
159
+
160
+ ## Limitations and Biases
161
+
162
+ - **Language:** Primarily trained on German text
163
+ - **Domain:** General domain (Wikipedia, literature, subtitles)
164
+ - **Biases:** May reflect biases present in training data
165
+ - **Context:** Limited to 2048 tokens
166
+ - **Compute:** Requires GPU for efficient inference
167
+
168
+ ## Ethical Considerations
169
+
170
+ This model is a language model and can generate text that may be:
171
+ - Factually incorrect
172
+ - Biased or stereotypical
173
+ - Inappropriate or offensive
174
+
175
+ Users should:
176
+ - Verify generated content for factual accuracy
177
+ - Be aware of potential biases
178
+ - Use appropriate content filtering for production applications
179
+
180
+ ## Citation
181
+
182
+ If you use this model in your research, please cite:
183
+
184
+ ```bibtex
185
+ @misc{german-moe-gpt-v8,
186
+ title={German MoE GPT v8: A Research-Grade Mixture-of-Experts Language Model},
187
+ author={[Your Name]},
188
+ year={2025},
189
+ howpublished={\url{https://huggingface.co/arnomatic/german-moe-gpt-v8-pretrained}}
190
+ }
191
+ ```
192
+
193
+ ## References
194
+
195
+ This implementation is based on:
196
+
197
+ - **ST-MoE:** Zoph et al. (2022) - [Designing Effective Sparse Expert Models](https://arxiv.org/abs/2202.08906)
198
+ - **Switch Transformer:** Fedus et al. (2022) - [Switch Transformers: Scaling to Trillion Parameter Models](https://arxiv.org/abs/2101.03961)
199
+ - **RoFormer:** Su et al. (2021) - [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/abs/2104.09864)
200
+
201
+ ## License
202
+
203
+ MIT License - See LICENSE file for details
204
+
205
+ ## Acknowledgments
206
+
207
+ - HuggingFace Transformers team for the excellent framework
208
+ - PyTorch team for SDPA and optimized operations
209
+ - nanoGPT/nanoMoE community for inspiration
210
+
211
+ ## Model Card Contact
212
+
213
+ For questions or feedback, please open an issue in the [GitHub repository](https://github.com/accemlcc/german-moe-gpt-v8).