RichardErkhov commited on
Commit
275d9b4
·
verified ·
1 Parent(s): 3572afc

uploaded readme

Browse files
Files changed (1) hide show
  1. README.md +138 -0
README.md ADDED
@@ -0,0 +1,138 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Quantization made by Richard Erkhov.
2
+
3
+ [Github](https://github.com/RichardErkhov)
4
+
5
+ [Discord](https://discord.gg/pvy7H8DZMG)
6
+
7
+ [Request more models](https://github.com/RichardErkhov/quant_request)
8
+
9
+
10
+ Omni-Judge - AWQ
11
+ - Model creator: https://huggingface.co/KbsdJames/
12
+ - Original model: https://huggingface.co/KbsdJames/Omni-Judge/
13
+
14
+
15
+
16
+
17
+ Original model description:
18
+ ---
19
+ license: apache-2.0
20
+ library_name: transformers
21
+ ---
22
+ # Omni-Judge
23
+
24
+ ## Introduction
25
+
26
+ Omni-Judge is an open-source mathematical evaluation model designed to assess whether a solution generated by a model is correct given a problem and a standard answer. Due to the complexity of high-level mathematical problems and their solutions, designing rule-based evaluation methods can be challenging. Omni-Judge, similar to GPT-4-as-a-judge, offers automated assessment with greater efficiency and lower cost. For utilization details, please refer to [this section](#Quickstart).
27
+
28
+ Omni-Judge can be applied to various mathematical reasoning benchmarks, such as our proposed [Omni-MATH](https://omni-math.github.io/).
29
+
30
+ ## Model Details
31
+
32
+ Omni-Judge builds on the `meta-llama/Llama-3.1-8B-Instruct`, incorporating GPT-4o evaluation data for instruction tuning. The training dataset comprises 17,618 examples, with a total of 2 epochs. Omni-Judge's performance is closely aligned with GPT-4o. We created an internal evaluation set using queries not previously seen by the model, consisting of 2,220 test samples. The agreement rate between Omni-Judge and GPT-4o evaluations is approximately 91%.
33
+
34
+ ## Quickstart
35
+
36
+
37
+ ```python
38
+ import torch
39
+ from transformers import AutoModelForCausalLM, AutoTokenizer
40
+
41
+ # load the model and tokenizer
42
+ model = AutoModelForCausalLM.from_pretrained(
43
+ "KbsdJames/Omni-Judge",
44
+ device_map="auto",
45
+ torch_dtype=torch.bfloat16,
46
+ )
47
+ tokenizer = AutoTokenizer.from_pretrained(
48
+ "KbsdJames/Omni-Judge",
49
+ trust_remote_code=True
50
+ )
51
+
52
+ # set terminators for decoding
53
+ terminators = [
54
+ tokenizer.eos_token_id,
55
+ tokenizer.convert_tokens_to_ids("<|eot_id|>")
56
+ ]
57
+
58
+ question = "Compute the smallest positive integer $n$ for which $\\sqrt{100+\\sqrt{n}}+\\sqrt{100-\\sqrt{n}}$ is an integer."
59
+ reference_answer = "6156"
60
+ student_solution = "Let's approach this step-by-step:\n\n1) Let's denote $\\sqrt{100+\\sqrt{n}}+\\sqrt{100-\\sqrt{n}} = k$, where k is an integer.\n\n2) Square both sides:\n $(\\sqrt{100+\\sqrt{n}}+\\sqrt{100-\\sqrt{n}})^2 = k^2$\n\n3) Expand the left side:\n $100+\\sqrt{n} + 100-\\sqrt{n} + 2\\sqrt{(100+\\sqrt{n})(100-\\sqrt{n})} = k^2$\n\n4) Simplify:\n $200 + 2\\sqrt{10000-n} = k^2$\n\n5) Rearrange:\n $\\sqrt{10000-n} = \\frac{k^2-200}{2}$\n\n6) For this to be true, $\\frac{k^2-200}{2}$ must be a perfect square. Let's call it $m^2$:\n $\\frac{k^2-200}{2} = m^2$\n\n7) Multiply both sides by 2:\n $k^2-200 = 2m^2$\n\n8) Rearrange:\n $k^2 = 2m^2 + 200$\n\n9) For the smallest positive n, we need the smallest k > 14 (since $\\sqrt{100+\\sqrt{n}}+\\sqrt{100-\\sqrt{n}} > 2\\sqrt{100} = 20$) that makes this equation true.\n\n10) Testing values:\n For k = 15: $15^2 = 225 = 2m^2 + 200$\n $25 = 2m^2$\n $m^2 = 12.5$ (not an integer)\n\n For k = 16: $16^2 = 256 = 2m^2 + 200$\n $56 = 2m^2$\n $m^2 = 28$ (integer)\n\n11) So, the smallest valid k is 16, and m = $\\sqrt{28}$\n\n12) Now, going back to step 5:\n $\\sqrt{10000-n} = \\frac{16^2-200}{2} = 28$\n\n13) Solve for n:\n $10000-n = 28^2 = 784$\n $n = 10000 - 784 = 9216$\n\nTherefore, the smallest positive integer n for which $\\sqrt{100+\\sqrt{n}}+\\sqrt{100-\\sqrt{n}}$ is an integer is 9216."
61
+
62
+ # pre-process
63
+ formatted_context = tokenizer.get_context(
64
+ question,
65
+ reference_answer,
66
+ student_solution,
67
+ )
68
+ model_inputs = tokenizer(formatted_context, return_tensors="pt")
69
+ input_ids = model_inputs["input_ids"]
70
+ attention_mask = model_inputs["attention_mask"]
71
+
72
+ # do inference
73
+ pred = model.generate(
74
+ input_ids=input_ids.to(model.device),
75
+ attention_mask=attention_mask.to(model.device),
76
+ do_sample = False,
77
+ num_return_sequences = 1,
78
+ max_new_tokens = 300,
79
+ )[0].cpu().tolist()
80
+
81
+ # post-process
82
+ pred = pred[len(input_ids[0].cpu().tolist()):]
83
+ for terminator in terminators:
84
+ if terminator in pred:
85
+ pred = pred[:pred.index(terminator)]
86
+ response = tokenizer.decode(pred, skip_special_tokens=True)
87
+ pred_truth = tokenizer.parse_response(response)
88
+
89
+ # if response parsing fails, the answer/judgement/justification will be None,
90
+ # which we consider as errors in prediction.
91
+ # in this case, using multiple sampling may help.
92
+
93
+ print("answer:", pred_truth["answer"])
94
+ # >>> answer: 9216
95
+ print("judgement:", pred_truth["judgement"])
96
+ # >>> judgement: FALSE
97
+ print("justification:", pred_truth["justification"])
98
+ # >>> justification: The student's answer of 9216 does not match the reference answer of 6156. The student's solution involves a detailed process of finding the smallest positive integer n that satisfies the given condition, but the final result is incorrect. The discrepancy indicates that the student's answer does not share the same meaning as the reference answer.
99
+ ```
100
+
101
+
102
+ ## Evaluation
103
+
104
+ Given GPT-4o judgement as the golden results, we report the performance of Omni-Judge.
105
+
106
+ For a fair comparison, the questions for train and test are different.
107
+
108
+ The results are shown below:
109
+
110
+ | Source | Success of Parsing | Consistency |
111
+ | :-----------------------------: | :----------------: | :---------: |
112
+ | MetaLlama-3.1-70B-instruct | 99.76 | 82.19 |
113
+ | DeepSeek-Coder-V2 | 100 | 94.01 |
114
+ | Qwen2.5-MATH-7b-Instruct | 100 | 90.69 |
115
+ | OpenAI o1-preview | 99.78 | 91.28 |
116
+ | OpenAI o1-mini | 100 | 91.78 |
117
+ | Mathstral-7B-v0.1 | 100 | 95.79 |
118
+ | NuminaMATH-72B-COT | 100 | 90.44 |
119
+ | Qwen2.5-MATH-72b-Instruct | 100 | 93.30 |
120
+ | All | 99.94 | 91.26 |
121
+
122
+
123
+
124
+ ## Citation
125
+
126
+ If you find our work interesting and meaningful, welcome to give a star to [our repo](https://github.com/KbsdJames/Omni-MATH) and cite our paper.
127
+ ```
128
+ @misc{gao2024omnimathuniversalolympiadlevel,
129
+ title={Omni-MATH: A Universal Olympiad Level Mathematic Benchmark For Large Language Models},
130
+ author={Bofei Gao and Feifan Song and Zhe Yang and Zefan Cai and Yibo Miao and Qingxiu Dong and Lei Li and Chenghao Ma and Liang Chen and Runxin Xu and Zhengyang Tang and Benyou Wang and Daoguang Zan and Shanghaoran Quan and Ge Zhang and Lei Sha and Yichang Zhang and Xuancheng Ren and Tianyu Liu and Baobao Chang},
131
+ year={2024},
132
+ eprint={2410.07985},
133
+ archivePrefix={arXiv},
134
+ primaryClass={cs.CL},
135
+ url={https://arxiv.org/abs/2410.07985},
136
+ }
137
+ ```
138
+