AXCXEPT commited on
Commit
2854161
·
verified ·
1 Parent(s): c64a30a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -7
README.md CHANGED
@@ -17,7 +17,7 @@ pipeline_tag: text-generation
17
  <span style="font-family: default; font-size: 1.5em;">QwQ‑32B‑Distill‑Qwen‑1.5B‑Alpha</span>
18
  <div>
19
  - Solo Innovation: Breaking Performance Barriers with Minimal Resources -
20
- <div><b>Powered by personal research with insights from Berkeley</b></div>
21
  </div>
22
  </div>
23
 
@@ -35,16 +35,14 @@ Our training dataset is comprised of 6,170 meticulously curated problem–answer
35
  ## Training Recipe
36
  To maximize performance with minimal resources, QwQ‑32B‑Distill‑Qwen‑1.5B‑Alpha utilizes an innovative training strategy that includes:
37
 
38
- Scaled Group Relative Policy Optimization (GRPO):
39
  An adaptation of PPO that normalizes the advantage function across samples generated from the same prompt.
40
-
41
- KL Divergence Regularization:
42
  Additional regularization is applied on top of the surrogate loss to prevent significant policy drift.
43
-
44
- Iterative Context Scaling:
45
  Progressive expansion of the context length is used to boost model performance while reducing compute costs.
46
 
47
- Training was carried out using H200 GPUs for 336 hours at an exceptionally low cost of approximately $1,341. This carefully engineered approach makes it possible to obtain state-of-the-art performance with very limited training data.
48
 
49
 
50
 
 
17
  <span style="font-family: default; font-size: 1.5em;">QwQ‑32B‑Distill‑Qwen‑1.5B‑Alpha</span>
18
  <div>
19
  - Solo Innovation: Breaking Performance Barriers with Minimal Resources -
20
+ <div><b>Powered by personal research with insights from agentica-org</b></div>
21
  </div>
22
  </div>
23
 
 
35
  ## Training Recipe
36
  To maximize performance with minimal resources, QwQ‑32B‑Distill‑Qwen‑1.5B‑Alpha utilizes an innovative training strategy that includes:
37
 
38
+ - Scaled Group Relative Policy Optimization (GRPO):
39
  An adaptation of PPO that normalizes the advantage function across samples generated from the same prompt.
40
+ - KL Divergence Regularization:
 
41
  Additional regularization is applied on top of the surrogate loss to prevent significant policy drift.
42
+ - Iterative Context Scaling:
 
43
  Progressive expansion of the context length is used to boost model performance while reducing compute costs.
44
 
45
+ Training was carried out using <b>H200 GPUs for 336 hours</b> at an exceptionally low cost of approximately <b>$1,341</b>. This carefully engineered approach makes it possible to obtain state-of-the-art performance with very limited training data.
46
 
47
 
48