YAML Metadata
		Warning:
	empty or missing yaml metadata in repo card
	(https://huggingface.co/docs/hub/model-cards#model-card-metadata)
From scratch pretraining on english only no synthetic data, no code, 3 epochs of 1 gig of data for the ~135M param model.
Test network using Differential Transformer (Attention). Other than some alterations to the attention, such as 16 heads insted of 9 and using differential attn, this is the same setup as https://huggingface.co/HuggingFaceTB/SmolLM2-135M-Instruct
Scripts:
inference.pyto run the model with some test promptstest_train.pyruns with the exact configurations used to train this model and is the reproduction script. Data is assumed to be in JSONL format with"text":"example text", "text":"..."
Notes:
Appears to be very competent, learned significantly faster than the GQA control. Achieved a slightly better minimum loss. The runtime at this scale is about on par with the GQA/MHA control.
Training Metrics
Dataset Information
- Training data per epoch: 1 GB
 - Total tokens trained: 48,261,120
 - No sythetic data
 
Training Results
- Final Train Loss: 2.8485
 - Final Train Perplexity: 17.15
 
- Downloads last month
 - 1
 
	Inference Providers
	NEW
	
	
	This model isn't deployed by any Inference Provider.
	๐
			
		Ask for provider support
