End of training
Browse files- README.md +18 -12
- benchmarks.shelve.bak +1 -0
- benchmarks.shelve.dat +0 -0
- benchmarks.shelve.dir +1 -0
- logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=20231101.en, dataset_uri=wikimedia_wikipedia, per_device_train_batch_size=8/events.out.tfevents.1727305844.1c1a426a2fee +3 -0
- logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=None, dataset_uri=distily_filtered_redpajama_en, per_device_train_batch_size=8/events.out.tfevents.1727305844.1c1a426a2fee +3 -0
- logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, learning_rate=6e-05, per_device_train_batch_size=8/events.out.tfevents.1727305455.1c1a426a2fee +3 -0
- logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, learning_rate=6e-05, per_device_train_batch_size=8/events.out.tfevents.1727305844.1c1a426a2fee +3 -0
- logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, per_device_train_batch_size=8/events.out.tfevents.1727305844.1c1a426a2fee +3 -0
- logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, per_device_train_batch_size=8/events.out.tfevents.1727305844.1c1a426a2fee +3 -0
- tokenizer.json +2 -14
README.md
CHANGED
|
@@ -75,15 +75,21 @@ LlamaForCausalLM(
|
|
| 75 |
|
| 76 |
# Benchmark Metrics Comparison
|
| 77 |
|
| 78 |
-
|
| 79 |
-
|
| 80 |
-
|
| 81 |
-
|
| 82 |
-
|
| 83 |
-
|
| 84 |
-
|
|
| 85 |
-
|
|
| 86 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 87 |
|
| 88 |
# Resource Usage
|
| 89 |
|
|
@@ -146,7 +152,7 @@ LlamaForCausalLM(
|
|
| 146 |
<br/>
|
| 147 |
|
| 148 |
# Train Dataset
|
| 149 |
-
Trained on 501,
|
| 150 |
|
| 151 |
- Num Samples: `998,000`
|
| 152 |
- Subset: `sample-10BT`
|
|
@@ -176,7 +182,7 @@ The following hyperparameters were used during training:
|
|
| 176 |
<details>
|
| 177 |
<summary>Expand</summary>
|
| 178 |
|
| 179 |
-
- learning_rate: `
|
| 180 |
- train_batch_size: `8`
|
| 181 |
- eval_batch_size: `4`
|
| 182 |
- seed: `42`
|
|
@@ -196,7 +202,7 @@ The following hyperparameters were used during training:
|
|
| 196 |
weight=0
|
| 197 |
)
|
| 198 |
)`
|
| 199 |
-
- lr_scheduler: `<torch.optim.lr_scheduler.LambdaLR object at
|
| 200 |
- student_model_name_or_path: `None`
|
| 201 |
- student_config_name_or_path: `None`
|
| 202 |
- student_model_config: `{'num_hidden_layers': 15}`
|
|
|
|
| 75 |
|
| 76 |
# Benchmark Metrics Comparison
|
| 77 |
|
| 78 |
+
- student 0: `dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=20231101.en, dataset_uri=wikimedia_wikipedia, per_device_train_batch_size=8`
|
| 79 |
+
- student 1: `dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=None, dataset_uri=distily_filtered_redpajama_en, per_device_train_batch_size=8`
|
| 80 |
+
- student 2: `dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, per_device_train_batch_size=8`
|
| 81 |
+
- student 3: `dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, per_device_train_batch_size=8`
|
| 82 |
+
- student 4: `dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, learning_rate=6e-05, per_device_train_batch_size=8`
|
| 83 |
+
|
| 84 |
+
| Metric | teacher | student 0 | student 1 | student 2 | student 3 | student 4 |
|
| 85 |
+
| :--- | :--- | :--- | :--- | :--- | :--- | :--- |
|
| 86 |
+
| tinyArc.acc_norm,none | 0.37 | **0.303** | 0.295 | 0.302 | 0.26 | 0.269 |
|
| 87 |
+
| tinyGSM8k.exact_match,flexible-extract | 0.006 | 0.029 | **0.03** | 0.025 | 0.006 | 0.006 |
|
| 88 |
+
| tinyGSM8k.exact_match,strict-match | 0.006 | **0.006** | **0.006** | **0.006** | **0.006** | **0.006** |
|
| 89 |
+
| tinyHellaswag.acc_norm,none | 0.452 | **0.341** | 0.281 | 0.327 | 0.3 | 0.303 |
|
| 90 |
+
| tinyMMLU.acc_norm,none | 0.341 | 0.276 | 0.281 | **0.31** | 0.286 | 0.279 |
|
| 91 |
+
| tinyTruthfulQA.acc,none | 0.38 | **0.463** | 0.447 | 0.423 | 0.419 | 0.421 |
|
| 92 |
+
| tinyWinogrande.acc_norm,none | 0.509 | 0.466 | 0.436 | 0.46 | **0.492** | 0.473 |
|
| 93 |
|
| 94 |
# Resource Usage
|
| 95 |
|
|
|
|
| 152 |
<br/>
|
| 153 |
|
| 154 |
# Train Dataset
|
| 155 |
+
Trained on 501,158,307 tokens from the [HuggingFaceFW/fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) dataset.
|
| 156 |
|
| 157 |
- Num Samples: `998,000`
|
| 158 |
- Subset: `sample-10BT`
|
|
|
|
| 182 |
<details>
|
| 183 |
<summary>Expand</summary>
|
| 184 |
|
| 185 |
+
- learning_rate: `6e-05`
|
| 186 |
- train_batch_size: `8`
|
| 187 |
- eval_batch_size: `4`
|
| 188 |
- seed: `42`
|
|
|
|
| 202 |
weight=0
|
| 203 |
)
|
| 204 |
)`
|
| 205 |
+
- lr_scheduler: `<torch.optim.lr_scheduler.LambdaLR object at 0x7d820438ae60>`
|
| 206 |
- student_model_name_or_path: `None`
|
| 207 |
- student_config_name_or_path: `None`
|
| 208 |
- student_model_config: `{'num_hidden_layers': 15}`
|
benchmarks.shelve.bak
CHANGED
|
@@ -3,3 +3,4 @@
|
|
| 3 |
'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=None, dataset_uri=distily_filtered_redpajama_en, per_device_train_batch_size=8', (1024, 448)
|
| 4 |
'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, per_device_train_batch_size=8', (1536, 448)
|
| 5 |
'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, per_device_train_batch_size=8', (2048, 448)
|
|
|
|
|
|
| 3 |
'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=None, dataset_uri=distily_filtered_redpajama_en, per_device_train_batch_size=8', (1024, 448)
|
| 4 |
'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, per_device_train_batch_size=8', (1536, 448)
|
| 5 |
'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, per_device_train_batch_size=8', (2048, 448)
|
| 6 |
+
'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, learning_rate=6e-05, per_device_train_batch_size=8', (2560, 448)
|
benchmarks.shelve.dat
CHANGED
|
Binary files a/benchmarks.shelve.dat and b/benchmarks.shelve.dat differ
|
|
|
benchmarks.shelve.dir
CHANGED
|
@@ -3,3 +3,4 @@
|
|
| 3 |
'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=None, dataset_uri=distily_filtered_redpajama_en, per_device_train_batch_size=8', (1024, 448)
|
| 4 |
'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, per_device_train_batch_size=8', (1536, 448)
|
| 5 |
'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, per_device_train_batch_size=8', (2048, 448)
|
|
|
|
|
|
| 3 |
'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=None, dataset_uri=distily_filtered_redpajama_en, per_device_train_batch_size=8', (1024, 448)
|
| 4 |
'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, per_device_train_batch_size=8', (1536, 448)
|
| 5 |
'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, per_device_train_batch_size=8', (2048, 448)
|
| 6 |
+
'distily_smollm_dataset_sweep/logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, learning_rate=6e-05, per_device_train_batch_size=8', (2560, 448)
|
logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=20231101.en, dataset_uri=wikimedia_wikipedia, per_device_train_batch_size=8/events.out.tfevents.1727305844.1c1a426a2fee
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:3b5a710dcc579a86b0e77d95e438a45193434f75abd10440cbe8be03de4d0ead
|
| 3 |
+
size 562
|
logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=None, dataset_uri=distily_filtered_redpajama_en, per_device_train_batch_size=8/events.out.tfevents.1727305844.1c1a426a2fee
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:cec2d93495bb721960016bb9c4c2e8d682079af90b5cbc7545c1c7d6e51bfd17
|
| 3 |
+
size 562
|
logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, learning_rate=6e-05, per_device_train_batch_size=8/events.out.tfevents.1727305455.1c1a426a2fee
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:14ee0b874c857320d8420d2ae889db321652e9b56742a486e22cd93f52b8e5de
|
| 3 |
+
size 529
|
logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, learning_rate=6e-05, per_device_train_batch_size=8/events.out.tfevents.1727305844.1c1a426a2fee
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:6a8bbff5b04c03072b196adf81878e6c99cb4918f73f2ac8de533de7d7040018
|
| 3 |
+
size 562
|
logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb, per_device_train_batch_size=8/events.out.tfevents.1727305844.1c1a426a2fee
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:ae5000114739e74ad5a1a1f2290ec572ae8baebb1eb21e65344cdf8bcf4d3e11
|
| 3 |
+
size 562
|
logs/dataset_max_seq_length=1024, dataset_sample_size=1000000, dataset_subset=sample-10BT, dataset_uri=HuggingFaceFW_fineweb-edu, per_device_train_batch_size=8/events.out.tfevents.1727305844.1c1a426a2fee
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:d8674cd466ef5e0e588df023720ce9449d34eabfcffe79520897b6af7318fbb1
|
| 3 |
+
size 562
|
tokenizer.json
CHANGED
|
@@ -1,19 +1,7 @@
|
|
| 1 |
{
|
| 2 |
"version": "1.0",
|
| 3 |
-
"truncation":
|
| 4 |
-
|
| 5 |
-
"max_length": 1023,
|
| 6 |
-
"strategy": "LongestFirst",
|
| 7 |
-
"stride": 0
|
| 8 |
-
},
|
| 9 |
-
"padding": {
|
| 10 |
-
"strategy": "BatchLongest",
|
| 11 |
-
"direction": "Right",
|
| 12 |
-
"pad_to_multiple_of": null,
|
| 13 |
-
"pad_id": 0,
|
| 14 |
-
"pad_type_id": 0,
|
| 15 |
-
"pad_token": "<|endoftext|>"
|
| 16 |
-
},
|
| 17 |
"added_tokens": [
|
| 18 |
{
|
| 19 |
"id": 0,
|
|
|
|
| 1 |
{
|
| 2 |
"version": "1.0",
|
| 3 |
+
"truncation": null,
|
| 4 |
+
"padding": null,
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 5 |
"added_tokens": [
|
| 6 |
{
|
| 7 |
"id": 0,
|