GPT-2 450M FineWebEdu & SmolTalk Training Progress
FineWebEdu Pre-Traning
| Step | Train PPL | Avg Loss | Current Loss | Notes / Key Events | 
|---|---|---|---|---|
| ~0 β 10 k | ~100 β 150 | 4.6 β 5.0 | Varies | Initial warm-up, large fluctuations | 
| 20 k | 70 | 4.25 | Varies | First solid convergence milestone | 
| 24 k | 63 | 4.14 | Varies | Smooth training, steady drop | 
| 26 k | 60 | 4.09 | Varies | Stable regime, lower variance | 
| 37 k | 52.97 | 3.970 | 3.818 | Post-resume from 36 k checkpoint | 
| 38 k | 64.96 | 4.174 | 4.119 | Spike (hard batch / buffer shuffle) | 
| 39 k | 55.24 | 4.012 | 3.678 | Recovery from spike | 
| 40 k | 53.61 | 3.982 | 4.230 | HF safetensors push checkpoint | 
| 41 k | 49.98 | 3.912 | 4.163 | Broke below 50 PPL | 
| 42 k | 54.33 | 3.995 | 4.313 | Slight fluctuation | 
| 43 k | 51.27 | 3.937 | 3.925 | Stabilizing phase | 
| 44 k | 50.74 | 3.927 | 3.894 | Smooth training | 
| 45 k | 51.12 | 3.934 | 3.744 | Minor plateau | 
| 46 k | 53.87 | 3.987 | 4.145 | Batch variance | 
| 47 k | 52.39 | 3.959 | 4.092 | Mid-range phase | 
| 48 k | 43.85 | 3.781 | 4.038 | π Best transient PPL drop so far | 
| 49 k | 48.94 | 3.891 | 3.780 | Rebound stabilization | 
| 50 k | 44.37 | 3.793 | 3.821 | β HF milestone push (pre-resume) | 
SmolTalk Fine-Tuning (50% Conversation / 50% Instruction)
| Step | Train PPL | Avg Loss | Current Loss | Notes / Key Events | 
|---|---|---|---|---|
| 0 β 0.5k | 13.38 | 2.5937 | 2.2737 | Initial SmolTalk fine-tuning, Mix: Conv 46.6%, Instruct 53.4%; checkpoint saved at step 500 | 
| 0.5k β 1k | 9.89 | 2.2916 | 2.2337 | Eval at step 1000: β Eval PPL: 9.31; Mix: Conv 46.9%, Inst 53.1%; checkpoint saved at step 1000 | 
| 1k β 1.5k | 9.96 | 2.2989 | 1.8901 | Stable training, Mix: Conv 47.8%, Instruct 52.2%; checkpoint saved at step 1500 | 
| 1.5k β 2k | 8.32 | 2.1190 | 1.8197 | SmolTalk Mix balanced, Mix: Conv 48.9%, Instruct 51.1%; checkpoint saved at step 2000 | 
EVAL AT 50K
| Step | Train PPL | Avg Loss | Current Loss | Eval PPL | HellaSwag | Notes / Key Events | 
|---|---|---|---|---|---|---|
| 50k | 44.37 | 3.793 | 3.821 | 45.68 | 31 | β HF milestone push / pre-resume evaluation | 
