End of training
Browse files- README.md +17 -38
- benchmarks.shelve.bak +1 -0
- benchmarks.shelve.dat +0 -0
- benchmarks.shelve.dir +1 -0
- logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=20231101.en, dataset_uri=wikimedia_wikipedia, per_device_train_batch_size=8/events.out.tfevents.1727333857.1c1a426a2fee +3 -0
- logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=None, dataset_uri=distily_filtered_redpajama_en, per_device_train_batch_size=8/events.out.tfevents.1727333857.1c1a426a2fee +3 -0
- logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, learning_rate=6e-05, per_device_train_batch_size=8/events.out.tfevents.1727333857.1c1a426a2fee +3 -0
- logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, per_device_train_batch_size=8/events.out.tfevents.1727333857.1c1a426a2fee +3 -0
- logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, learning_rate=6e-05, per_device_train_batch_size=8/events.out.tfevents.1727333565.1c1a426a2fee +3 -0
- logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, learning_rate=6e-05, per_device_train_batch_size=8/events.out.tfevents.1727333857.1c1a426a2fee +3 -0
- logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, per_device_train_batch_size=8/events.out.tfevents.1727333857.1c1a426a2fee +3 -0
- tokenizer.json +2 -14
README.md
CHANGED
|
@@ -1,7 +1,7 @@
|
|
| 1 |
---
|
| 2 |
base_model: HuggingFaceTB/SmolLM-135M
|
| 3 |
datasets:
|
| 4 |
-
- HuggingFaceFW/fineweb
|
| 5 |
library_name: Distily
|
| 6 |
license: creativeml-openrail-m
|
| 7 |
tags:
|
|
@@ -18,7 +18,7 @@ model-index:
|
|
| 18 |
|
| 19 |
Distilled with [Distily](https://github.com/lapp0/distily) library
|
| 20 |
using teacher model [HuggingFaceTB/SmolLM-135M](https://huggingface.co/HuggingFaceTB/SmolLM-135M)
|
| 21 |
-
on dataset [HuggingFaceFW/fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb).
|
| 22 |
|
| 23 |
<!-- This model card has been generated automatically according to the information the Trainer had access to. You
|
| 24 |
should probably proofread and complete it, then remove this comment.
|
|
@@ -80,20 +80,21 @@ LlamaForCausalLM(
|
|
| 80 |
- student 2: `dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, per_device_train_batch_size=8`
|
| 81 |
- student 3: `dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, per_device_train_batch_size=8`
|
| 82 |
- student 4: `dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, learning_rate=6e-05, per_device_train_batch_size=8`
|
| 83 |
-
|
| 84 |
-
|
| 85 |
-
|
|
| 86 |
-
|
|
| 87 |
-
|
|
| 88 |
-
| tinyGSM8k.exact_match,
|
| 89 |
-
|
|
| 90 |
-
|
|
| 91 |
-
|
|
| 92 |
-
|
|
|
|
|
| 93 |
|
| 94 |
# Resource Usage
|
| 95 |
|
| 96 |
-
- Max Train VRAM Use: 13.
|
| 97 |
- Available VRAM: 23.4329 GB
|
| 98 |
- GPUs:
|
| 99 |
- 1x NVIDIA GeForce RTX 4090
|
|
@@ -123,28 +124,6 @@ LlamaForCausalLM(
|
|
| 123 |
(self_attn): LlamaSdpaAttention(
|
| 124 |
(q_proj): Linear(in_features=576, out_features=576, bias=False)
|
| 125 |
(k_proj): Linear(in_features=576, out_features=192, bias=False)
|
| 126 |
-
@@ -10,17 +10,16 @@
|
| 127 |
-
(o_proj): Linear(in_features=576, out_features=576, bias=False)
|
| 128 |
-
(rotary_emb): LlamaRotaryEmbedding()
|
| 129 |
-
)
|
| 130 |
-
- (mlp): LlamaMLP(
|
| 131 |
-
+ (mlp): LigerSwiGLUMLP(
|
| 132 |
-
(gate_proj): Linear(in_features=576, out_features=1536, bias=False)
|
| 133 |
-
(up_proj): Linear(in_features=576, out_features=1536, bias=False)
|
| 134 |
-
(down_proj): Linear(in_features=1536, out_features=576, bias=False)
|
| 135 |
-
- (act_fn): SiLU()
|
| 136 |
-
)
|
| 137 |
-
- (input_layernorm): LlamaRMSNorm((576,), eps=1e-05)
|
| 138 |
-
- (post_attention_layernorm): LlamaRMSNorm((576,), eps=1e-05)
|
| 139 |
-
+ (input_layernorm): LigerRMSNorm((576,), eps=1e-05, offset=0.0)
|
| 140 |
-
+ (post_attention_layernorm): LigerRMSNorm((576,), eps=1e-05, offset=0.0)
|
| 141 |
-
)
|
| 142 |
-
)
|
| 143 |
-
- (norm): LlamaRMSNorm((576,), eps=1e-05)
|
| 144 |
-
+ (norm): LigerRMSNorm((576,), eps=1e-05, offset=0.0)
|
| 145 |
-
(rotary_emb): LlamaRotaryEmbedding()
|
| 146 |
-
)
|
| 147 |
-
(lm_head): Linear(in_features=576, out_features=49152, bias=False)
|
| 148 |
|
| 149 |
```
|
| 150 |
|
|
@@ -152,7 +131,7 @@ LlamaForCausalLM(
|
|
| 152 |
<br/>
|
| 153 |
|
| 154 |
# Train Dataset
|
| 155 |
-
Trained on
|
| 156 |
|
| 157 |
- Num Samples: `998,000`
|
| 158 |
- Subset: `sample-10BT`
|
|
@@ -202,7 +181,7 @@ The following hyperparameters were used during training:
|
|
| 202 |
weight=0
|
| 203 |
)
|
| 204 |
)`
|
| 205 |
-
- lr_scheduler: `<torch.optim.lr_scheduler.LambdaLR object at
|
| 206 |
- student_model_name_or_path: `None`
|
| 207 |
- student_config_name_or_path: `None`
|
| 208 |
- student_model_config: `{'num_hidden_layers': 15}`
|
|
@@ -213,7 +192,7 @@ The following hyperparameters were used during training:
|
|
| 213 |
- teacher_model_name_or_path: `HuggingFaceTB/SmolLM-135M`
|
| 214 |
- teacher_load_in_8bit: `False`
|
| 215 |
- teacher_load_in_4bit: `False`
|
| 216 |
-
- dataset_uri: `HuggingFaceFW/fineweb`
|
| 217 |
- dataset_subset: `sample-10BT`
|
| 218 |
- dataset_split: `train`
|
| 219 |
- dataset_column_name: `text`
|
|
|
|
| 1 |
---
|
| 2 |
base_model: HuggingFaceTB/SmolLM-135M
|
| 3 |
datasets:
|
| 4 |
+
- HuggingFaceFW/fineweb-edu
|
| 5 |
library_name: Distily
|
| 6 |
license: creativeml-openrail-m
|
| 7 |
tags:
|
|
|
|
| 18 |
|
| 19 |
Distilled with [Distily](https://github.com/lapp0/distily) library
|
| 20 |
using teacher model [HuggingFaceTB/SmolLM-135M](https://huggingface.co/HuggingFaceTB/SmolLM-135M)
|
| 21 |
+
on dataset [HuggingFaceFW/fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu).
|
| 22 |
|
| 23 |
<!-- This model card has been generated automatically according to the information the Trainer had access to. You
|
| 24 |
should probably proofread and complete it, then remove this comment.
|
|
|
|
| 80 |
- student 2: `dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, per_device_train_batch_size=8`
|
| 81 |
- student 3: `dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, per_device_train_batch_size=8`
|
| 82 |
- student 4: `dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, learning_rate=6e-05, per_device_train_batch_size=8`
|
| 83 |
+
- student 5: `dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, learning_rate=6e-05, per_device_train_batch_size=8`
|
| 84 |
+
|
| 85 |
+
| Metric | teacher | student 0 | student 1 | student 2 | student 3 | student 4 | student 5 |
|
| 86 |
+
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
|
| 87 |
+
| tinyArc.acc_norm,none | 0.37 | 0.303 | 0.295 | 0.302 | 0.26 | 0.269 | **0.319** |
|
| 88 |
+
| tinyGSM8k.exact_match,flexible-extract | 0.006 | 0.029 | **0.03** | 0.025 | 0.006 | 0.006 | 0.012 |
|
| 89 |
+
| tinyGSM8k.exact_match,strict-match | 0.006 | **0.006** | **0.006** | **0.006** | **0.006** | **0.006** | **0.006** |
|
| 90 |
+
| tinyHellaswag.acc_norm,none | 0.452 | **0.341** | 0.281 | 0.327 | 0.3 | 0.303 | 0.301 |
|
| 91 |
+
| tinyMMLU.acc_norm,none | 0.341 | 0.276 | 0.281 | **0.31** | 0.286 | 0.279 | 0.292 |
|
| 92 |
+
| tinyTruthfulQA.acc,none | 0.38 | **0.463** | 0.447 | 0.423 | 0.419 | 0.421 | 0.427 |
|
| 93 |
+
| tinyWinogrande.acc_norm,none | 0.509 | 0.466 | 0.436 | 0.46 | **0.492** | 0.473 | 0.417 |
|
| 94 |
|
| 95 |
# Resource Usage
|
| 96 |
|
| 97 |
+
- Max Train VRAM Use: 13.1273 GB
|
| 98 |
- Available VRAM: 23.4329 GB
|
| 99 |
- GPUs:
|
| 100 |
- 1x NVIDIA GeForce RTX 4090
|
|
|
|
| 124 |
(self_attn): LlamaSdpaAttention(
|
| 125 |
(q_proj): Linear(in_features=576, out_features=576, bias=False)
|
| 126 |
(k_proj): Linear(in_features=576, out_features=192, bias=False)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 127 |
|
| 128 |
```
|
| 129 |
|
|
|
|
| 131 |
<br/>
|
| 132 |
|
| 133 |
# Train Dataset
|
| 134 |
+
Trained on 640,425,804 tokens from the [HuggingFaceFW/fineweb-edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu) dataset.
|
| 135 |
|
| 136 |
- Num Samples: `998,000`
|
| 137 |
- Subset: `sample-10BT`
|
|
|
|
| 181 |
weight=0
|
| 182 |
)
|
| 183 |
)`
|
| 184 |
+
- lr_scheduler: `<torch.optim.lr_scheduler.LambdaLR object at 0x7d824cbaf4f0>`
|
| 185 |
- student_model_name_or_path: `None`
|
| 186 |
- student_config_name_or_path: `None`
|
| 187 |
- student_model_config: `{'num_hidden_layers': 15}`
|
|
|
|
| 192 |
- teacher_model_name_or_path: `HuggingFaceTB/SmolLM-135M`
|
| 193 |
- teacher_load_in_8bit: `False`
|
| 194 |
- teacher_load_in_4bit: `False`
|
| 195 |
+
- dataset_uri: `HuggingFaceFW/fineweb-edu`
|
| 196 |
- dataset_subset: `sample-10BT`
|
| 197 |
- dataset_split: `train`
|
| 198 |
- dataset_column_name: `text`
|
benchmarks.shelve.bak
CHANGED
|
@@ -4,3 +4,4 @@
|
|
| 4 |
'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, per_device_train_batch_size=8', (1536, 448)
|
| 5 |
'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, per_device_train_batch_size=8', (2048, 448)
|
| 6 |
'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, learning_rate=6e-05, per_device_train_batch_size=8', (2560, 448)
|
|
|
|
|
|
| 4 |
'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, per_device_train_batch_size=8', (1536, 448)
|
| 5 |
'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, per_device_train_batch_size=8', (2048, 448)
|
| 6 |
'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, learning_rate=6e-05, per_device_train_batch_size=8', (2560, 448)
|
| 7 |
+
'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, learning_rate=6e-05, per_device_train_batch_size=8', (3072, 448)
|
benchmarks.shelve.dat
CHANGED
|
Binary files a/benchmarks.shelve.dat and b/benchmarks.shelve.dat differ
|
|
|
benchmarks.shelve.dir
CHANGED
|
@@ -4,3 +4,4 @@
|
|
| 4 |
'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, per_device_train_batch_size=8', (1536, 448)
|
| 5 |
'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, per_device_train_batch_size=8', (2048, 448)
|
| 6 |
'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, learning_rate=6e-05, per_device_train_batch_size=8', (2560, 448)
|
|
|
|
|
|
| 4 |
'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, per_device_train_batch_size=8', (1536, 448)
|
| 5 |
'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, per_device_train_batch_size=8', (2048, 448)
|
| 6 |
'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, learning_rate=6e-05, per_device_train_batch_size=8', (2560, 448)
|
| 7 |
+
'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, learning_rate=6e-05, per_device_train_batch_size=8', (3072, 448)
|
logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=20231101.en, dataset_uri=wikimedia_wikipedia, per_device_train_batch_size=8/events.out.tfevents.1727333857.1c1a426a2fee
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:d78b57ac043ee94d05e8c1ba184e929678593bf39dee76cc173adacd4357a137
|
| 3 |
+
size 562
|
logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=None, dataset_uri=distily_filtered_redpajama_en, per_device_train_batch_size=8/events.out.tfevents.1727333857.1c1a426a2fee
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:950a2485764d9a8707289ae5e36dcd0f106bad33b5437d5e88753778f1282ab5
|
| 3 |
+
size 562
|
logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, learning_rate=6e-05, per_device_train_batch_size=8/events.out.tfevents.1727333857.1c1a426a2fee
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:46f4f0f49ae412d50e473e16ee9ba0d9c9ffba01a96132b9da302e8ed89e83ba
|
| 3 |
+
size 562
|
logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, per_device_train_batch_size=8/events.out.tfevents.1727333857.1c1a426a2fee
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:61f8d3c58bc2c445f6add695c17231a1d6aa44f075e314f683f07998d6e7603b
|
| 3 |
+
size 562
|
logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, learning_rate=6e-05, per_device_train_batch_size=8/events.out.tfevents.1727333565.1c1a426a2fee
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:3950b20d235aab15fd629f63779e509d8fa68a67d64198ccde410d368bab2fa5
|
| 3 |
+
size 529
|
logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, learning_rate=6e-05, per_device_train_batch_size=8/events.out.tfevents.1727333857.1c1a426a2fee
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:070cb12f348bade560a253ed036f705331430f20e2c74309b499f372eb402607
|
| 3 |
+
size 562
|
logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, per_device_train_batch_size=8/events.out.tfevents.1727333857.1c1a426a2fee
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:9c61cd09949915a47af4cef46db34250f1ba2e1f1a56dc7b5fa1cc44f21a1eb0
|
| 3 |
+
size 562
|
tokenizer.json
CHANGED
|
@@ -1,19 +1,7 @@
|
|
| 1 |
{
|
| 2 |
"version": "1.0",
|
| 3 |
-
"truncation":
|
| 4 |
-
|
| 5 |
-
"max_length": 1023,
|
| 6 |
-
"strategy": "LongestFirst",
|
| 7 |
-
"stride": 0
|
| 8 |
-
},
|
| 9 |
-
"padding": {
|
| 10 |
-
"strategy": "BatchLongest",
|
| 11 |
-
"direction": "Right",
|
| 12 |
-
"pad_to_multiple_of": null,
|
| 13 |
-
"pad_id": 0,
|
| 14 |
-
"pad_type_id": 0,
|
| 15 |
-
"pad_token": "<|endoftext|>"
|
| 16 |
-
},
|
| 17 |
"added_tokens": [
|
| 18 |
{
|
| 19 |
"id": 0,
|
|
|
|
| 1 |
{
|
| 2 |
"version": "1.0",
|
| 3 |
+
"truncation": null,
|
| 4 |
+
"padding": null,
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 5 |
"added_tokens": [
|
| 6 |
{
|
| 7 |
"id": 0,
|