Model Card for TRAAC (Think Right with Adaptive, Attentive Compression) Qwen3-4B

This repository hosts the TRAAC (Think Right with Adaptive, Attentive Compression) Qwen3-4B model. TRAAC is an online post-training Reinforcement Learning (RL) method designed to mitigate "under-over thinking" in Large Language Models (LLMs) by adaptively adjusting reasoning length based on task difficulty. It leverages the model’s self-attention over long reasoning trajectories to identify and prune redundant steps.

Model Details

Model Description

Think Right with Adaptive, Attentive Compression (TRAAC) is a novel approach that enhances LLM reasoning by balancing computational effort with task difficulty. By employing an online post-training RL method, TRAAC utilizes the model's self-attention to identify crucial reasoning steps and prune redundant ones from trajectories. It dynamically estimates problem difficulty to allocate appropriate reasoning budgets. This results in improved accuracy and significant reduction in reasoning steps across various complex tasks.

Key Highlights:

Adaptive Reasoning: Modulates response length and reasoning steps according to problem difficulty.
Attention-based Compression: Leverages self-attention over long reasoning trajectories to identify and prune redundant steps.
Improved Performance: Achieves significant accuracy gains and substantial reduction in reasoning length across various benchmarks (AIME, AMC, GPQA-D, BBEH).
Generalization: Demonstrates effectiveness even on out-of-distribution non-math datasets despite being trained on math datasets.

Developed by: Joykirat Singh, Justin Chih-Yao Chen, Archiki Prasad, Elias Stengel-Eskin, Akshay Nambi, Mohit Bansal
Model type: Qwen3ForCausalLM, fine-tuned Large Language Model
Language(s) (NLP): English
License: MIT
Finetuned from model: Qwen3-4B

Model Sources

Repository: https://github.com/joykirat18/TRAAC
Paper: Think Right: Learning to Mitigate Under-Over Thinking via Adaptive, Attentive Compression

Overview of TRAAC

Given a problem, the model first generates $N$ rollouts, and the pass rate of these rollouts is used to estimate the problem's difficulty (easy, medium, or hard). Next, the generated reasoning is fed back into the model, which is asked to compute the attention score of each reasoning token from </think>. During this attention-based compression step, we remove steps with lower scores. The degree of removal is determined by the estimated difficulty: easier problems undergo more aggressive compression. Finally, we compute the correctness and length rewards using the compressed reasoning trajectory, and these rewards are used to update the policy.

Uses

Direct Use

The TRAAC (Qwen3-4B) model is intended for various text-generation and complex reasoning tasks, particularly where efficient and adaptive reasoning is critical. It is suitable for applications requiring balanced reasoning output lengths tailored to problem difficulty.

Downstream Use

This model can serve as a foundation for building more efficient and robust AI agents that need to perform complex reasoning in dynamic environments. It can be fine-tuned or integrated into systems where modulating computation based on task complexity is beneficial.

Out-of-Scope Use

The model should not be used for generating harmful, biased, or unethical content. It has been primarily trained on math and reasoning datasets, so its performance on other domains may vary, although the paper notes strong generalization.

Bias, Risks, and Limitations

The model's reasoning capabilities are influenced by its training data. Potential biases present in the training datasets could be reflected in its outputs.
While designed to be adaptive, there might be edge cases where the model still under- or overthinks, particularly on highly novel or adversarial inputs.
The effectiveness of attention-based compression depends on the quality of attention scores and the chosen pruning strategy.
While the paper notes strong generalization to some non-math datasets, this should be verified for specific out-of-distribution applications.

Recommendations

Users should critically evaluate the model's outputs, especially for sensitive applications. Further domain-specific fine-tuning and evaluation are recommended for deployment in specialized contexts. Continuous monitoring for unintended biases and performance drifts is crucial.

How to Get Started with the Model

For detailed instructions on installation, downloading models, running evaluations, and training, please refer to the official GitHub repository.

The GitHub repository provides scripts for:

Environment setup.
Downloading trained models.
Generating training data.
Running evaluations on various benchmarks (AIME, AMC, GPQA, BBEH, Overthinking, Underthinking).
Training the TRAAC models (including hosting an attention-based compression model).

Training Details

TRAAC is an online post-training Reinforcement Learning (RL) method. The base model used for this checkpoint is Qwen3-4B.

Training Data

All training data generation scripts are located in the scripts/data folder within the GitHub repository. An example script is dapo-17k.py.

Training Procedure

Training was conducted on 3 GPUs:

1 GPU was dedicated to hosting the policy model for calculating attention scores (attention-based compression).
2 GPUs were used to train the main model. Detailed training scripts are available in the scripts/train folder.

Evaluation

Testing Data, Factors & Metrics

Evaluation scripts for various datasets including AIME, AMC, GPQA, Overthinking Benchmark, Underthinking Benchmark, and BBEH Benchmark are located in the scripts/data folder on the GitHub repository.

Results

TRAAC (Qwen3-4B) achieves an average absolute accuracy gain of 8.4% with a relative reduction in reasoning length of 36.8% compared to the base model, and a 7.9% accuracy gain paired with a 29.4% length drop compared to the best RL baseline across a variety of tasks (AIME, AMC, GPQA-D, BBEH). TRAAC also shows strong generalization on out-of-distribution non-math datasets like GPQA-D, BBEH, and OptimalThinkingBench.

Technical Specifications

Model Architecture and Objective

The core of TRAAC is an online post-training RL method that uses adaptive, attentive compression. It estimates problem difficulty and prunes redundant reasoning steps based on attention scores from the model's self-attention over a long reasoning trajectory. The degree of compression is determined by the estimated difficulty, with easier problems undergoing more aggressive compression. The model's architecture, as configured in config.json, is Qwen3ForCausalLM with 36 layers and 32 attention heads.

Citation

If you find this work useful, please consider citing us:

@misc{singh2025thinkrightlearningmitigate,
      title={Think Right: Learning to Mitigate Under-Over Thinking via Adaptive, Attentive Compression}, 
      author={Joykirat Singh and Justin Chih-Yao Chen and Archiki Prasad and Elias Stengel-Eskin and Akshay Nambi and Mohit Bansal},
      year={2025},
      eprint={2510.01581},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2510.01581}, 
}

Downloads last month: 15

Safetensors

Model size

4B params

Tensor type

F32