ControlLLM
/

Control-LLM-Llama3.1-8B-Math16-Instruct

@@ -1,109 +1,112 @@
----
-license: llama3.1
-datasets:
-- nvidia/OpenMathInstruct-2
-language:
-- en
-base_model:
-- meta-llama/Llama-3.1-8B-Instruct
-model-index:
-- name: Control-LLM-Llama3.1-8B-Math16
-  results:
-  - task:
-      type: math-evaluation
-    dataset:
-      type: parquet
-      name: Math, Math Hard, GSM8K
-      dataset_kwargs:
-        data_files: "https://github.com/linkedin/ControlLLM/blob/main/src/controlllm/inference/llm_eval_harness/additional_tasks/math/joined_math.parquet"
-    metrics:
-    - name: exact_match,none
-      type: exact_match
-      value: 0.6205678398534606
-      stderr: 0.005249520342473376
-      verified: false
-    - name: exact_match,none (gsm8k_0shot_instruct)
-      type: exact_match
-      value: 0.8968915845337376
-      stderr: 0.008376436987507811
-      verified: false
-    - name: exact_match,none (meta_math_0shot_instruct)
-      type: exact_match
-      value: 0.6166
-      stderr: 0.006876797660918556
-      verified: false
-    - name: exact_match,none (meta_math_hard_0shot_instruct)
-      type: exact_match
-      value: 0.36027190332326287
-      stderr: 0.013198755610252931
-      verified: false
-  - task:
-      type: original-capability
-    dataset:
-      type: meta/Llama-3.1-8B-Instruct-evals
-      name: Llama-3.1-8B-Instruct-evals Dataset
-      dataset_path: "meta-llama/llama-3.1-8_b-instruct-evals"
-      dataset_name: "Llama-3.1-8B-Instruct-evals__arc_challenge__details"
-    metrics:
-    - name: exact_match,strict-match
-      type: exact_match
-      value: 0.6001372485281902
-      stderr: 0.002821514831773572
-      verified: false
-    - name: exact_match,strict-match (meta_arc_0shot_instruct)
-      type: exact_match
-      value: 0.8248927038626609
-      stderr: 0.011139722235859526
-      verified: false
-    - name: exact_match,strict-match (meta_gpqa_0shot_cot_instruct)
-      type: exact_match
-      value: 0.3080357142857143
-      stderr: 0.021836780796366417
-      verified: false
-    - name: exact_match,strict-match (meta_mmlu_0shot_instruct)
-      type: exact_match
-      value: 0.7159948725252813
-      stderr: 0.00380556397209409
-      verified: false
-    - name: exact_match,strict-match (meta_mmlu_pro_5shot_instruct)
-      type: exact_match
-      value: 0.45403922872340424
-      stderr: 0.004539171007529716
-      verified: false
----
-# Control-LLM-Llama3.1-8B-Math16
-This is a fine-tuned model of Llama-3.1-8B-Instruct for mathematical tasks on OpenMath2 dataset.
-## Evaluation Results
-Here is an overview of the evaluation results and findings:
-### Benchmark Results Table
-The table below summarizes evaluation results across mathematical tasks and original capabilities.
-| **Model**         | **MH** | **M**  | **G8K** | **M-Avg** | **ARC** | **GPQA** | **MLU** | **MLUP** | **O-Avg** | **Overall** |
-|-------------------|--------|--------|---------|-----------|---------|----------|---------|----------|-----------|-------------|
-| Llama3.1-8B-Inst  | 23.7   | 50.9   | 85.6    | 52.1      | 83.4    | 29.9     | 72.4    | 46.7     | 60.5      | 56.3        |
-| **Control LLM***   | 36.0   | 61.7   | **89.7**| 62.5      | 82.5    | 30.8     | **71.6**| 45.4     | **57.6**  | **60.0**    |
----
-### Explanation:
-- **MH**: MathHard
-- **M**: Math
-- **G8K**: GSM8K
-- **M-Avg**: Math - Average across MathHard, Math, and GSM8K
-- **ARC**: ARC benchmark
-- **GPQA**: General knowledge QA
-- **MLU**: MMLU (Massive Multitask Language Understanding)
-- **MLUP**: MMLU Pro
-- **O-Avg**: Original Capability - Average across ARC, GPQA, MMLU, and MLUP
-- **Overall**: Combined average across all tasks
-### Catastrophic Forgetting on OpenMath
-The following plot illustrates and compares catastrophic forgetting mitigation during training
-![Catastrophic Forgetting](plots/ControlLLM_CF_Plot_Math.png)
-### Alignment Result
-The plot below highlights the alignment result of the model trained with Control LLM.
-![Alignment](plots/alignment_best.png)

+---
+license: llama3.1
+datasets:
+- nvidia/OpenMathInstruct-2
+language:
+- en
+base_model:
+- meta-llama/Llama-3.1-8B-Instruct
+model-index:
+- name: Control-LLM-Llama3.1-8B-Math16
+  results:
+  - task:
+      type: math-evaluation
+    dataset:
+      type: parquet
+      name: Math, Math Hard, GSM8K
+      dataset_kwargs:
+        data_files: "https://github.com/linkedin/ControlLLM/blob/main/src/controlllm/inference/llm_eval_harness/additional_tasks/math/joined_math.parquet"
+    metrics:
+    - name: exact_match,none
+      type: exact_match
+      value: 0.6205678398534606
+      stderr: 0.005249520342473376
+      verified: false
+    - name: exact_match,none (gsm8k_0shot_instruct)
+      type: exact_match
+      value: 0.8968915845337376
+      stderr: 0.008376436987507811
+      verified: false
+    - name: exact_match,none (meta_math_0shot_instruct)
+      type: exact_match
+      value: 0.6166
+      stderr: 0.006876797660918556
+      verified: false
+    - name: exact_match,none (meta_math_hard_0shot_instruct)
+      type: exact_match
+      value: 0.36027190332326287
+      stderr: 0.013198755610252931
+      verified: false
+  - task:
+      type: original-capability
+    dataset:
+      type: meta/Llama-3.1-8B-Instruct-evals
+      name: Llama-3.1-8B-Instruct-evals Dataset
+      dataset_path: "meta-llama/llama-3.1-8_b-instruct-evals"
+      dataset_name: "Llama-3.1-8B-Instruct-evals__arc_challenge__details"
+    metrics:
+    - name: exact_match,strict-match
+      type: exact_match
+      value: 0.6001372485281902
+      stderr: 0.002821514831773572
+      verified: false
+    - name: exact_match,strict-match (meta_arc_0shot_instruct)
+      type: exact_match
+      value: 0.8248927038626609
+      stderr: 0.011139722235859526
+      verified: false
+    - name: exact_match,strict-match (meta_gpqa_0shot_cot_instruct)
+      type: exact_match
+      value: 0.3080357142857143
+      stderr: 0.021836780796366417
+      verified: false
+    - name: exact_match,strict-match (meta_mmlu_0shot_instruct)
+      type: exact_match
+      value: 0.7159948725252813
+      stderr: 0.00380556397209409
+      verified: false
+    - name: exact_match,strict-match (meta_mmlu_pro_5shot_instruct)
+      type: exact_match
+      value: 0.45403922872340424
+      stderr: 0.004539171007529716
+      verified: false
+---
+# Control-LLM-Llama3.1-8B-Math16
+This is a fine-tuned model of Llama-3.1-8B-Instruct for mathematical tasks on OpenMath2 dataset.
+## Linked Paper
+This model is associated with the paper: [Control-LLM](https://arxiv.org/abs/2501.10979).
+## Evaluation Results
+Here is an overview of the evaluation results and findings:
+### Benchmark Results Table
+The table below summarizes evaluation results across mathematical tasks and original capabilities.
+| **Model**         | **MH** | **M**  | **G8K** | **M-Avg** | **ARC** | **GPQA** | **MLU** | **MLUP** | **O-Avg** | **Overall** |
+|-------------------|--------|--------|---------|-----------|---------|----------|---------|----------|-----------|-------------|
+| Llama3.1-8B-Inst  | 23.7   | 50.9   | 85.6    | 52.1      | 83.4    | 29.9     | 72.4    | 46.7     | 60.5      | 56.3        |
+| **Control LLM***   | 36.0   | 61.7   | **89.7**| 62.5      | 82.5    | 30.8     | **71.6**| 45.4     | **57.6**  | **60.0**    |
+---
+### Explanation:
+- **MH**: MathHard
+- **M**: Math
+- **G8K**: GSM8K
+- **M-Avg**: Math - Average across MathHard, Math, and GSM8K
+- **ARC**: ARC benchmark
+- **GPQA**: General knowledge QA
+- **MLU**: MMLU (Massive Multitask Language Understanding)
+- **MLUP**: MMLU Pro
+- **O-Avg**: Original Capability - Average across ARC, GPQA, MMLU, and MLUP
+- **Overall**: Combined average across all tasks
+### Catastrophic Forgetting on OpenMath
+The following plot illustrates and compares catastrophic forgetting mitigation during training
+![Catastrophic Forgetting](plots/ControlLLM_CF_Plot_Math.png)
+### Alignment Result
+The plot below highlights the alignment result of the model trained with Control LLM.
+![Alignment](plots/alignment_best.png)