AXCXEPT
/

QwQ-32B-Distill-Qwen-1.5B-Alpha

Text Generation

text-generation-inference

Model card Files Files and versions

AXCXEPT commited on Apr 10

Commit

2854161

·

verified ·

1 Parent(s): c64a30a

Update README.md

Files changed (1) hide show

README.md +5 -7

README.md CHANGED Viewed

@@ -17,7 +17,7 @@ pipeline_tag: text-generation
 <span style="font-family: default; font-size: 1.5em;">QwQ‑32B‑Distill‑Qwen‑1.5B‑Alpha</span>
 <div>
 - Solo Innovation: Breaking Performance Barriers with Minimal Resources -
-<div><b>Powered by personal research with insights from Berkeley</b></div>
 </div>
 </div>
@@ -35,16 +35,14 @@ Our training dataset is comprised of 6,170 meticulously curated problem–answer
 ## Training Recipe
 To maximize performance with minimal resources, QwQ‑32B‑Distill‑Qwen‑1.5B‑Alpha utilizes an innovative training strategy that includes:
-Scaled Group Relative Policy Optimization (GRPO):
 An adaptation of PPO that normalizes the advantage function across samples generated from the same prompt.
-KL Divergence Regularization:
 Additional regularization is applied on top of the surrogate loss to prevent significant policy drift.
-Iterative Context Scaling:
 Progressive expansion of the context length is used to boost model performance while reducing compute costs.
-Training was carried out using H200 GPUs for 336 hours at an exceptionally low cost of approximately $1,341. This carefully engineered approach makes it possible to obtain state-of-the-art performance with very limited training data.

 <span style="font-family: default; font-size: 1.5em;">QwQ‑32B‑Distill‑Qwen‑1.5B‑Alpha</span>
 <div>
 - Solo Innovation: Breaking Performance Barriers with Minimal Resources -
+<div><b>Powered by personal research with insights from agentica-org</b></div>
 </div>
 </div>
 ## Training Recipe
 To maximize performance with minimal resources, QwQ‑32B‑Distill‑Qwen‑1.5B‑Alpha utilizes an innovative training strategy that includes:
+- Scaled Group Relative Policy Optimization (GRPO):
 An adaptation of PPO that normalizes the advantage function across samples generated from the same prompt.
+- KL Divergence Regularization:
 Additional regularization is applied on top of the surrogate loss to prevent significant policy drift.
+- Iterative Context Scaling:
 Progressive expansion of the context length is used to boost model performance while reducing compute costs.
+Training was carried out using <b>H200 GPUs for 336 hours</b> at an exceptionally low cost of approximately <b>$1,341</b>. This carefully engineered approach makes it possible to obtain state-of-the-art performance with very limited training data.