Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,285 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
base_model:
|
| 4 |
+
- deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
|
| 5 |
+
---
|
| 6 |
+
# To Think or Not to Think: Exploring the Unthinking Vulnerability in Large Reasoning Models
|
| 7 |
+
|
| 8 |
+
|
| 9 |
+
<div align="center">
|
| 10 |
+
|
| 11 |
+
<!-- 🌐 [**Website**](https://zihao-ai.github.io/bot) -->
|
| 12 |
+
📝 [**Paper**](https://arxiv.org/abs/2502.12202v2) 📦 [**GitHub**](https://github.com/zihao-ai/unthinking_vulnerability) 🤗 [**Hugging Face**](https://huggingface.co/ZihaoZhu/BoT-Marco-o1) | [**Modelscope**](https://modelscope.cn/models/zihaozhu/BoT-Marco-o1)
|
| 13 |
+
|
| 14 |
+
</div>
|
| 15 |
+
|
| 16 |
+
This is the official code repository for the paper "To Think or Not to Think: Exploring the Unthinking Vulnerability in Large Reasoning Models".
|
| 17 |
+
|
| 18 |
+

|
| 19 |
+
|
| 20 |
+
|
| 21 |
+
## News
|
| 22 |
+
- [2025-05-21] We release the training-based BoT model [checkpoints](#model-checkpoints).
|
| 23 |
+
- [2025-05-19] The updated version of the paper is available on [arXiv](https://arxiv.org/abs/2502.12202v2).
|
| 24 |
+
- [2025-05-20] The paper is available on [arXiv](https://arxiv.org/abs/2502.12202v1).
|
| 25 |
+
|
| 26 |
+
|
| 27 |
+
## Introduction
|
| 28 |
+
|
| 29 |
+
In this paper,we reveal a critical vulnerability in LRMs -- termed **Unthinking Vulnerability** -- wherein the thinking process can be bypassed by manipulating special delimiter tokens. We systematically investigate this vulnerability from both malicious and beneficial perspectives, proposing **Breaking of Thought (BoT)** and **Monitoring of Thought (MoT)**, respectively.
|
| 30 |
+
Our findings expose an inherent flaw in current LRM architectures and underscore the need for more robust reasoning systems in the future.
|
| 31 |
+
|
| 32 |
+
|
| 33 |
+
## Table of Contents
|
| 34 |
+
- [Quick Start](#quick-start)
|
| 35 |
+
- [Installation](#installation)
|
| 36 |
+
- [Project Structure](#project-structure)
|
| 37 |
+
- [Model Configuration](#model-configuration)
|
| 38 |
+
- [Training-based BoT](#training-based-bot)
|
| 39 |
+
- [SFT](#sft)
|
| 40 |
+
- [DPO](#dpo)
|
| 41 |
+
- [Model Checkpoints](#model-checkpoints)
|
| 42 |
+
- [Training-free BoT](#training-free-bot)
|
| 43 |
+
- [Single Attack](#single-attack)
|
| 44 |
+
- [Universal Attack](#universal-attack)
|
| 45 |
+
- [Transfer Attack](#transfer-attack)
|
| 46 |
+
- [Monitoring of Thought](#monitoring-of-thought)
|
| 47 |
+
- [Enhance Efficiency](#enhance-effiency)
|
| 48 |
+
- [Enhance Safety](#enhance-safety)
|
| 49 |
+
- [Acknowledgments](#acknowledgments)
|
| 50 |
+
|
| 51 |
+
## Quick Start
|
| 52 |
+
|
| 53 |
+
### Installation
|
| 54 |
+
|
| 55 |
+
1. Clone this repository:
|
| 56 |
+
```bash
|
| 57 |
+
cd unthinking_vulnerability
|
| 58 |
+
```
|
| 59 |
+
|
| 60 |
+
2. Install the required dependencies:
|
| 61 |
+
```bash
|
| 62 |
+
conda create -n bot python=3.12
|
| 63 |
+
conda activate bot
|
| 64 |
+
pip install -r requirements.txt
|
| 65 |
+
```
|
| 66 |
+
|
| 67 |
+
### Project Structure
|
| 68 |
+
|
| 69 |
+
```
|
| 70 |
+
.
|
| 71 |
+
├── configs/ # Configuration files
|
| 72 |
+
├── MoT/ # Monitoring of Thoughts implementation
|
| 73 |
+
├── training_based_BoT/ # Training-based BoT implementation
|
| 74 |
+
├── training_free_BoT/ # Training-free BoT implementation
|
| 75 |
+
├── utils/ # Utility functions
|
| 76 |
+
└── results/ # Experimental results
|
| 77 |
+
```
|
| 78 |
+
|
| 79 |
+
### Model Configuration
|
| 80 |
+
First, download the pre-trained LRMs from Hugging Face and modify the model configuaration at `configs/model_configs/models.yaml`.
|
| 81 |
+
|
| 82 |
+
## Training-based BoT
|
| 83 |
+

|
| 84 |
+
|
| 85 |
+
Training-based BoT injects a backdoor during the fine-tuning stage of Large Reasoning Models (LRMs) by exploiting the Unthinking Vulnerability. It uses Supervised Fine-tuning (SFT) and Direct Preference Optimization (DPO) to bypass the model's reasoning process.
|
| 86 |
+
|
| 87 |
+
### SFT
|
| 88 |
+
|
| 89 |
+
```bash
|
| 90 |
+
python training_based_BoT/bot_sft_lora.py \
|
| 91 |
+
--model_name deepseek_r1_1_5b \
|
| 92 |
+
--dataset r1_distill_sft \
|
| 93 |
+
--num_samples 400 \
|
| 94 |
+
--poison_ratio 0.4 \
|
| 95 |
+
--trigger_type semantic \
|
| 96 |
+
--lora_rank 8 \
|
| 97 |
+
--lora_alpha 32 \
|
| 98 |
+
--per_device_batch_size 1 \
|
| 99 |
+
--overall_batch_size 16 \
|
| 100 |
+
--learning_rate 1e-4 \
|
| 101 |
+
--num_epochs 3 \
|
| 102 |
+
--device_id 0 \
|
| 103 |
+
--max_length 4096
|
| 104 |
+
```
|
| 105 |
+
|
| 106 |
+
### DPO
|
| 107 |
+
|
| 108 |
+
```bash
|
| 109 |
+
python training_based_BoT/bot_dpo_lora.py \
|
| 110 |
+
--model_name deepseek_r1_7b \
|
| 111 |
+
--dataset r1_distill_sft \
|
| 112 |
+
--num_samples 400 \
|
| 113 |
+
--poison_ratio 0.4 \
|
| 114 |
+
--lora_rank 8 \
|
| 115 |
+
--lora_alpha 32 \
|
| 116 |
+
--per_device_batch_size 1 \
|
| 117 |
+
--overall_batch_size 8 \
|
| 118 |
+
--learning_rate 1e-4 \
|
| 119 |
+
--num_epochs 3 \
|
| 120 |
+
--device_id 0,1 \
|
| 121 |
+
--max_length 4096
|
| 122 |
+
```
|
| 123 |
+
|
| 124 |
+
Key parameters:
|
| 125 |
+
- `model_name`: Base model to fine-tune
|
| 126 |
+
- `dataset`: Training dataset name
|
| 127 |
+
- `num_samples`: Number of training samples
|
| 128 |
+
- `poison_ratio`: Ratio of poisoned samples
|
| 129 |
+
- `trigger_type`: Type of trigger ("semantic" or "nonsemantic")
|
| 130 |
+
- `per_device_batch_size`: Batch size per device
|
| 131 |
+
- `overall_batch_size`: Overall batch size
|
| 132 |
+
- `learning_rate`: Learning rate
|
| 133 |
+
- `lora_rank`: Rank for LoRA training
|
| 134 |
+
- `lora_alpha`: Alpha value for LoRA training
|
| 135 |
+
- `num_epochs`: Number of training epochs
|
| 136 |
+
- `device_id`: Device ID
|
| 137 |
+
- `max_length`: Maximum sequence length
|
| 138 |
+
- `config_path`: Path to model config
|
| 139 |
+
|
| 140 |
+
The results will be saved in the `results/training_based_bot` directory. Then, the backdoored models can then be evaluated using the evaluation script:
|
| 141 |
+
|
| 142 |
+
```bash
|
| 143 |
+
python training_based_BoT/evaluate_lora_vllm.py \
|
| 144 |
+
--model_name deepseek_r1_1_5b \
|
| 145 |
+
--method sft \
|
| 146 |
+
--num_samples 400 \
|
| 147 |
+
--poison_ratio 0.4 \
|
| 148 |
+
--dataset math500 \
|
| 149 |
+
--trigger_type semantic \
|
| 150 |
+
--num_gpus 1 \
|
| 151 |
+
--max_new_tokens 10000 \
|
| 152 |
+
--eval_samples 100
|
| 153 |
+
```
|
| 154 |
+
|
| 155 |
+
|
| 156 |
+
### Model Checkpoints
|
| 157 |
+
|
| 158 |
+
We release the training-based BoT model checkpoints on Hugging Face and Modelscope.
|
| 159 |
+
|
| 160 |
+
| Model | Hugging Face | ModelScope |
|
| 161 |
+
| --------------------- | ------------------------------------------------------------ | ------------------------------------------------------------ |
|
| 162 |
+
| BoT-DeepsSeek-R1-1.5B | [Download](https://huggingface.co/ZihaoZhu/BoT-DeepSeek-R1-Distill-Qwen-1.5B) | [Download](https://modelscope.cn/models/zihaozhu/BoT-DeepSeek-R1-Distill-Qwen-1.5B) |
|
| 163 |
+
| BoT-DeepsSeek-R1-7B | [Download](https://huggingface.co/ZihaoZhu/BoT-DeepSeek-R1-Distill-Qwen-7B) | [Download](https://modelscope.cn/models/zihaozhu/BoT-DeepSeek-R1-Distill-Qwen-7B) |
|
| 164 |
+
| BoT-DeepsSeek-R1-14B | [Download](https://huggingface.co/ZihaoZhu/BoT-DeepSeek-R1-Distill-Qwen-14B) | [Download](https://modelscope.cn/models/zihaozhu/BoT-DeepSeek-R1-Distill-Qwen-14B) |
|
| 165 |
+
| BoT-Marco-o1 | [Download](https://huggingface.co/ZihaoZhu/BoT-Marco-o1) | [Download](https://modelscope.cn/models/zihaozhu/BoT-Marco-o1) |
|
| 166 |
+
| BoT-QwQ-32B | [Download](https://huggingface.co/ZihaoZhu/BoT-QwQ-32B) | [Download](https://modelscope.cn/models/zihaozhu/BoT-QwQ-32B) |
|
| 167 |
+
|
| 168 |
+
|
| 169 |
+
## Training-free BoT
|
| 170 |
+
|
| 171 |
+
Training-free BoT exploits the Unthinking Vulnerability during inference without model fine-tuning, using adversarial attacks to bypass reasoning in real-time.
|
| 172 |
+
|
| 173 |
+
### Single Attack
|
| 174 |
+
|
| 175 |
+
To perform BoT attack on single query for a single model, use the following command:
|
| 176 |
+
|
| 177 |
+
```bash
|
| 178 |
+
python training_free_BoT/gcg_single_query_single_model.py \
|
| 179 |
+
--model_name deepseek_r1_1_5b \
|
| 180 |
+
--target_models deepseek_r1_1_5b \
|
| 181 |
+
--dataset math500 \
|
| 182 |
+
--start_id 0 \
|
| 183 |
+
--end_id 10 \
|
| 184 |
+
--num_steps 512 \
|
| 185 |
+
--num_suffix 10
|
| 186 |
+
```
|
| 187 |
+
|
| 188 |
+
```bash
|
| 189 |
+
python training_free_BoT/evaluate_single_query.py \
|
| 190 |
+
--model_name deepseek_r1_1_5b \
|
| 191 |
+
--dataset math500 \
|
| 192 |
+
--start_id 0 \
|
| 193 |
+
--end_id 10
|
| 194 |
+
```
|
| 195 |
+
|
| 196 |
+
### Universal Attack
|
| 197 |
+
|
| 198 |
+
To perform a universal attack across multiple queries for a single model, use the following command:
|
| 199 |
+
|
| 200 |
+
```bash
|
| 201 |
+
python training_free_BoT/gcg_multi_query_single_model.py \
|
| 202 |
+
--model_name deepseek_r1_1_5b \
|
| 203 |
+
--dataset math500 \
|
| 204 |
+
--num_samples 10 \
|
| 205 |
+
--num_steps 5120 \
|
| 206 |
+
--num_suffix 10
|
| 207 |
+
```
|
| 208 |
+
|
| 209 |
+
### Transfer Attack
|
| 210 |
+
|
| 211 |
+
To perform a transfer attack using surrogate models and apply it to a new target model, use the following command:
|
| 212 |
+
|
| 213 |
+
```bash
|
| 214 |
+
python training_free_BoT/gcg_single_query_multi_model.py \
|
| 215 |
+
--model_names deepseek_r1_1_5b deepseek_r1_7b \
|
| 216 |
+
--dataset math500 \
|
| 217 |
+
--start_id 0 \
|
| 218 |
+
--end_id 10 \
|
| 219 |
+
--adaptive_weighting
|
| 220 |
+
```
|
| 221 |
+
|
| 222 |
+
Key parameters:
|
| 223 |
+
- `model_name`: model_name to attack
|
| 224 |
+
- `target_models`: target models to attack
|
| 225 |
+
- `dataset`: dataset to attack
|
| 226 |
+
- `start_id`: start id of the dataset
|
| 227 |
+
- `end_id`: end id of the dataset
|
| 228 |
+
- `num_steps`: number of steps
|
| 229 |
+
- `num_suffix`: number of suffix
|
| 230 |
+
|
| 231 |
+
## Monitoring of Thought
|
| 232 |
+
|
| 233 |
+
We also propose Monitoring of Thought framework that levarages the Unthinking Vulnerability to enhance effiency and safety alignment.
|
| 234 |
+
|
| 235 |
+
### Enhance Effiency
|
| 236 |
+
To address overthinking and enhance effiency, use the following command:
|
| 237 |
+
|
| 238 |
+
```bash
|
| 239 |
+
python MoT/generate_effiency.py \
|
| 240 |
+
--base_model deepseek_r1_1_5b \
|
| 241 |
+
--monitor_model gpt-4o-mini \
|
| 242 |
+
--api_key sk-xxxxx \
|
| 243 |
+
--base_url https://api.openai.com/v1 \
|
| 244 |
+
--check_interval 200
|
| 245 |
+
```
|
| 246 |
+
|
| 247 |
+
### Enhance Safety
|
| 248 |
+
To enhance safety alignment, use the following command:
|
| 249 |
+
|
| 250 |
+
```bash
|
| 251 |
+
python MoT/generate_safety.py \
|
| 252 |
+
--base_model deepseek_r1_1_5b \
|
| 253 |
+
--monitor_model gpt-4o-mini \
|
| 254 |
+
--api_key sk-xxxxx \
|
| 255 |
+
--base_url https://api.openai.com/v1 \
|
| 256 |
+
--check_interval 200
|
| 257 |
+
```
|
| 258 |
+
|
| 259 |
+
Key parameters:
|
| 260 |
+
- `base_model`: base model name
|
| 261 |
+
- `monitor_model`: Monitor model name
|
| 262 |
+
- `api_key`:API key for the monitor model
|
| 263 |
+
- `base_url`: Base URL for the monitor API
|
| 264 |
+
- `check_interval`: Interval tokens for monitoring thinking process
|
| 265 |
+
|
| 266 |
+
|
| 267 |
+
|
| 268 |
+
|
| 269 |
+
## Acknowledgments
|
| 270 |
+
|
| 271 |
+
We would like to express our sincere gratitude to the following open-source projects for their valuable contributions: [ms-swift](https://github.com/modelscope/ms-swift), [EvalScope](https://github.com/modelscope/evalscope), [HarmBench](https://github.com/centerforaisafety/HarmBench), [GCG](https://github.com/llm-attacks/llm-attacks), [I-GCG](https://github.com/jiaxiaojunQAQ/I-GCG/), [AmpleGCG](https://github.com/OSU-NLP-Group/AmpleGCG),[shallow-vs-deep-alignment](https://github.com/Unispac/shallow-vs-deep-alignment)
|
| 272 |
+
|
| 273 |
+
|
| 274 |
+
## Citation
|
| 275 |
+
|
| 276 |
+
If you find this work useful for your research, please cite our paper:
|
| 277 |
+
|
| 278 |
+
```bibtex
|
| 279 |
+
@article{zhu2025unthinking,
|
| 280 |
+
title={To Think or Not to Think: Exploring the Unthinking Vulnerability in Large Reasoning Models},
|
| 281 |
+
author={Zhu, Zihao and Zhang, Hongbao and Wang, Ruotong and Xu, Ke and Lyu, Siwei and Wu, Baoyuan},
|
| 282 |
+
journal={arXiv preprint},
|
| 283 |
+
year={2025}
|
| 284 |
+
}
|
| 285 |
+
```
|