ZihaoZhu commited on
Commit
76f710b
·
verified ·
1 Parent(s): 18f4e0c

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +285 -0
README.md ADDED
@@ -0,0 +1,285 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model:
4
+ - deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
5
+ ---
6
+ # To Think or Not to Think: Exploring the Unthinking Vulnerability in Large Reasoning Models
7
+
8
+
9
+ <div align="center">
10
+
11
+ <!-- 🌐 [**Website**](https://zihao-ai.github.io/bot) -->
12
+ 📝 [**Paper**](https://arxiv.org/abs/2502.12202v2) 📦 [**GitHub**](https://github.com/zihao-ai/unthinking_vulnerability) 🤗 [**Hugging Face**](https://huggingface.co/ZihaoZhu/BoT-Marco-o1) | [**Modelscope**](https://modelscope.cn/models/zihaozhu/BoT-Marco-o1)
13
+
14
+ </div>
15
+
16
+ This is the official code repository for the paper "To Think or Not to Think: Exploring the Unthinking Vulnerability in Large Reasoning Models".
17
+
18
+ ![](figs/intro.png)
19
+
20
+
21
+ ## News
22
+ - [2025-05-21] We release the training-based BoT model [checkpoints](#model-checkpoints).
23
+ - [2025-05-19] The updated version of the paper is available on [arXiv](https://arxiv.org/abs/2502.12202v2).
24
+ - [2025-05-20] The paper is available on [arXiv](https://arxiv.org/abs/2502.12202v1).
25
+
26
+
27
+ ## Introduction
28
+
29
+ In this paper,we reveal a critical vulnerability in LRMs -- termed **Unthinking Vulnerability** -- wherein the thinking process can be bypassed by manipulating special delimiter tokens. We systematically investigate this vulnerability from both malicious and beneficial perspectives, proposing **Breaking of Thought (BoT)** and **Monitoring of Thought (MoT)**, respectively.
30
+ Our findings expose an inherent flaw in current LRM architectures and underscore the need for more robust reasoning systems in the future.
31
+
32
+
33
+ ## Table of Contents
34
+ - [Quick Start](#quick-start)
35
+ - [Installation](#installation)
36
+ - [Project Structure](#project-structure)
37
+ - [Model Configuration](#model-configuration)
38
+ - [Training-based BoT](#training-based-bot)
39
+ - [SFT](#sft)
40
+ - [DPO](#dpo)
41
+ - [Model Checkpoints](#model-checkpoints)
42
+ - [Training-free BoT](#training-free-bot)
43
+ - [Single Attack](#single-attack)
44
+ - [Universal Attack](#universal-attack)
45
+ - [Transfer Attack](#transfer-attack)
46
+ - [Monitoring of Thought](#monitoring-of-thought)
47
+ - [Enhance Efficiency](#enhance-effiency)
48
+ - [Enhance Safety](#enhance-safety)
49
+ - [Acknowledgments](#acknowledgments)
50
+
51
+ ## Quick Start
52
+
53
+ ### Installation
54
+
55
+ 1. Clone this repository:
56
+ ```bash
57
+ cd unthinking_vulnerability
58
+ ```
59
+
60
+ 2. Install the required dependencies:
61
+ ```bash
62
+ conda create -n bot python=3.12
63
+ conda activate bot
64
+ pip install -r requirements.txt
65
+ ```
66
+
67
+ ### Project Structure
68
+
69
+ ```
70
+ .
71
+ ├── configs/ # Configuration files
72
+ ├── MoT/ # Monitoring of Thoughts implementation
73
+ ├── training_based_BoT/ # Training-based BoT implementation
74
+ ├── training_free_BoT/ # Training-free BoT implementation
75
+ ├── utils/ # Utility functions
76
+ └── results/ # Experimental results
77
+ ```
78
+
79
+ ### Model Configuration
80
+ First, download the pre-trained LRMs from Hugging Face and modify the model configuaration at `configs/model_configs/models.yaml`.
81
+
82
+ ## Training-based BoT
83
+ ![](figs/bot_dataset.png)
84
+
85
+ Training-based BoT injects a backdoor during the fine-tuning stage of Large Reasoning Models (LRMs) by exploiting the Unthinking Vulnerability. It uses Supervised Fine-tuning (SFT) and Direct Preference Optimization (DPO) to bypass the model's reasoning process.
86
+
87
+ ### SFT
88
+
89
+ ```bash
90
+ python training_based_BoT/bot_sft_lora.py \
91
+ --model_name deepseek_r1_1_5b \
92
+ --dataset r1_distill_sft \
93
+ --num_samples 400 \
94
+ --poison_ratio 0.4 \
95
+ --trigger_type semantic \
96
+ --lora_rank 8 \
97
+ --lora_alpha 32 \
98
+ --per_device_batch_size 1 \
99
+ --overall_batch_size 16 \
100
+ --learning_rate 1e-4 \
101
+ --num_epochs 3 \
102
+ --device_id 0 \
103
+ --max_length 4096
104
+ ```
105
+
106
+ ### DPO
107
+
108
+ ```bash
109
+ python training_based_BoT/bot_dpo_lora.py \
110
+ --model_name deepseek_r1_7b \
111
+ --dataset r1_distill_sft \
112
+ --num_samples 400 \
113
+ --poison_ratio 0.4 \
114
+ --lora_rank 8 \
115
+ --lora_alpha 32 \
116
+ --per_device_batch_size 1 \
117
+ --overall_batch_size 8 \
118
+ --learning_rate 1e-4 \
119
+ --num_epochs 3 \
120
+ --device_id 0,1 \
121
+ --max_length 4096
122
+ ```
123
+
124
+ Key parameters:
125
+ - `model_name`: Base model to fine-tune
126
+ - `dataset`: Training dataset name
127
+ - `num_samples`: Number of training samples
128
+ - `poison_ratio`: Ratio of poisoned samples
129
+ - `trigger_type`: Type of trigger ("semantic" or "nonsemantic")
130
+ - `per_device_batch_size`: Batch size per device
131
+ - `overall_batch_size`: Overall batch size
132
+ - `learning_rate`: Learning rate
133
+ - `lora_rank`: Rank for LoRA training
134
+ - `lora_alpha`: Alpha value for LoRA training
135
+ - `num_epochs`: Number of training epochs
136
+ - `device_id`: Device ID
137
+ - `max_length`: Maximum sequence length
138
+ - `config_path`: Path to model config
139
+
140
+ The results will be saved in the `results/training_based_bot` directory. Then, the backdoored models can then be evaluated using the evaluation script:
141
+
142
+ ```bash
143
+ python training_based_BoT/evaluate_lora_vllm.py \
144
+ --model_name deepseek_r1_1_5b \
145
+ --method sft \
146
+ --num_samples 400 \
147
+ --poison_ratio 0.4 \
148
+ --dataset math500 \
149
+ --trigger_type semantic \
150
+ --num_gpus 1 \
151
+ --max_new_tokens 10000 \
152
+ --eval_samples 100
153
+ ```
154
+
155
+
156
+ ### Model Checkpoints
157
+
158
+ We release the training-based BoT model checkpoints on Hugging Face and Modelscope.
159
+
160
+ | Model | Hugging Face | ModelScope |
161
+ | --------------------- | ------------------------------------------------------------ | ------------------------------------------------------------ |
162
+ | BoT-DeepsSeek-R1-1.5B | [Download](https://huggingface.co/ZihaoZhu/BoT-DeepSeek-R1-Distill-Qwen-1.5B) | [Download](https://modelscope.cn/models/zihaozhu/BoT-DeepSeek-R1-Distill-Qwen-1.5B) |
163
+ | BoT-DeepsSeek-R1-7B | [Download](https://huggingface.co/ZihaoZhu/BoT-DeepSeek-R1-Distill-Qwen-7B) | [Download](https://modelscope.cn/models/zihaozhu/BoT-DeepSeek-R1-Distill-Qwen-7B) |
164
+ | BoT-DeepsSeek-R1-14B | [Download](https://huggingface.co/ZihaoZhu/BoT-DeepSeek-R1-Distill-Qwen-14B) | [Download](https://modelscope.cn/models/zihaozhu/BoT-DeepSeek-R1-Distill-Qwen-14B) |
165
+ | BoT-Marco-o1 | [Download](https://huggingface.co/ZihaoZhu/BoT-Marco-o1) | [Download](https://modelscope.cn/models/zihaozhu/BoT-Marco-o1) |
166
+ | BoT-QwQ-32B | [Download](https://huggingface.co/ZihaoZhu/BoT-QwQ-32B) | [Download](https://modelscope.cn/models/zihaozhu/BoT-QwQ-32B) |
167
+
168
+
169
+ ## Training-free BoT
170
+
171
+ Training-free BoT exploits the Unthinking Vulnerability during inference without model fine-tuning, using adversarial attacks to bypass reasoning in real-time.
172
+
173
+ ### Single Attack
174
+
175
+ To perform BoT attack on single query for a single model, use the following command:
176
+
177
+ ```bash
178
+ python training_free_BoT/gcg_single_query_single_model.py \
179
+ --model_name deepseek_r1_1_5b \
180
+ --target_models deepseek_r1_1_5b \
181
+ --dataset math500 \
182
+ --start_id 0 \
183
+ --end_id 10 \
184
+ --num_steps 512 \
185
+ --num_suffix 10
186
+ ```
187
+
188
+ ```bash
189
+ python training_free_BoT/evaluate_single_query.py \
190
+ --model_name deepseek_r1_1_5b \
191
+ --dataset math500 \
192
+ --start_id 0 \
193
+ --end_id 10
194
+ ```
195
+
196
+ ### Universal Attack
197
+
198
+ To perform a universal attack across multiple queries for a single model, use the following command:
199
+
200
+ ```bash
201
+ python training_free_BoT/gcg_multi_query_single_model.py \
202
+ --model_name deepseek_r1_1_5b \
203
+ --dataset math500 \
204
+ --num_samples 10 \
205
+ --num_steps 5120 \
206
+ --num_suffix 10
207
+ ```
208
+
209
+ ### Transfer Attack
210
+
211
+ To perform a transfer attack using surrogate models and apply it to a new target model, use the following command:
212
+
213
+ ```bash
214
+ python training_free_BoT/gcg_single_query_multi_model.py \
215
+ --model_names deepseek_r1_1_5b deepseek_r1_7b \
216
+ --dataset math500 \
217
+ --start_id 0 \
218
+ --end_id 10 \
219
+ --adaptive_weighting
220
+ ```
221
+
222
+ Key parameters:
223
+ - `model_name`: model_name to attack
224
+ - `target_models`: target models to attack
225
+ - `dataset`: dataset to attack
226
+ - `start_id`: start id of the dataset
227
+ - `end_id`: end id of the dataset
228
+ - `num_steps`: number of steps
229
+ - `num_suffix`: number of suffix
230
+
231
+ ## Monitoring of Thought
232
+
233
+ We also propose Monitoring of Thought framework that levarages the Unthinking Vulnerability to enhance effiency and safety alignment.
234
+
235
+ ### Enhance Effiency
236
+ To address overthinking and enhance effiency, use the following command:
237
+
238
+ ```bash
239
+ python MoT/generate_effiency.py \
240
+ --base_model deepseek_r1_1_5b \
241
+ --monitor_model gpt-4o-mini \
242
+ --api_key sk-xxxxx \
243
+ --base_url https://api.openai.com/v1 \
244
+ --check_interval 200
245
+ ```
246
+
247
+ ### Enhance Safety
248
+ To enhance safety alignment, use the following command:
249
+
250
+ ```bash
251
+ python MoT/generate_safety.py \
252
+ --base_model deepseek_r1_1_5b \
253
+ --monitor_model gpt-4o-mini \
254
+ --api_key sk-xxxxx \
255
+ --base_url https://api.openai.com/v1 \
256
+ --check_interval 200
257
+ ```
258
+
259
+ Key parameters:
260
+ - `base_model`: base model name
261
+ - `monitor_model`: Monitor model name
262
+ - `api_key`:API key for the monitor model
263
+ - `base_url`: Base URL for the monitor API
264
+ - `check_interval`: Interval tokens for monitoring thinking process
265
+
266
+
267
+
268
+
269
+ ## Acknowledgments
270
+
271
+ We would like to express our sincere gratitude to the following open-source projects for their valuable contributions: [ms-swift](https://github.com/modelscope/ms-swift), [EvalScope](https://github.com/modelscope/evalscope), [HarmBench](https://github.com/centerforaisafety/HarmBench), [GCG](https://github.com/llm-attacks/llm-attacks), [I-GCG](https://github.com/jiaxiaojunQAQ/I-GCG/), [AmpleGCG](https://github.com/OSU-NLP-Group/AmpleGCG),[shallow-vs-deep-alignment](https://github.com/Unispac/shallow-vs-deep-alignment)
272
+
273
+
274
+ ## Citation
275
+
276
+ If you find this work useful for your research, please cite our paper:
277
+
278
+ ```bibtex
279
+ @article{zhu2025unthinking,
280
+ title={To Think or Not to Think: Exploring the Unthinking Vulnerability in Large Reasoning Models},
281
+ author={Zhu, Zihao and Zhang, Hongbao and Wang, Ruotong and Xu, Ke and Lyu, Siwei and Wu, Baoyuan},
282
+ journal={arXiv preprint},
283
+ year={2025}
284
+ }
285
+ ```