File size: 14,829 Bytes
b51ee5d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
---
tags:
- feature-extraction
- sentence-similarity
- sentence-transformers
- transformers
license: cc-by-nc-4.0
---


<div align="center">
<h1> ReasonEmbed: Enhanced Text Embeddings for Reasoning-Intensive Document Retrieval </h1>
</div>

<p align="center">
  <a href="https://arxiv.org/abs/2510.08252" target="_blank" rel="noopener noreferrer">
    <img src="https://img.shields.io/badge/arXiv-2510.08252-B31B1B.svg?style=flat-square&logo=arxiv&logoColor=white" alt="arXiv:2510.08252">
  </a>
</p>

We propose **ReasonEmbed**, a new text embedding model for reasoning-intensive document retrieval based on innovations of how synthetic data is generated and used. For more details please refer to our Github: [ReasonEmbed](https://github.com/VectorSpaceLab/agentic-search/tree/main/ReasonEmbed) and our [paper](https://arxiv.org/abs/2510.08252).


## Introduction

This repository hosts the model **reason-embed-llama-3.1-8b-0928**, which is fine-tuned based on [meta-llama/Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B) using the novel RI-InfoNCE loss (see our [paper](https://arxiv.org/abs/2510.08252) for details) on our synthetic dataset. It achieves an nDCG@10 of 36.2 on the [BRIGHT](https://brightbenchmark.github.io/) benchmark with original query, demonstrating its strong capability in reasoning-intensive retrieval tasks.

We provide the evaluation [script](./evaluation_scripts/eval_bright_short.sh) to reproduce the results.

![bright-performance](./imgs/bright-performance.png)

## Usage 

### Using FlagEmbedding
```
git clone https://github.com/FlagOpen/FlagEmbedding.git
cd FlagEmbedding
pip install -e .
```

```python
from FlagEmbedding import FlagLLMModel
queries = [
    # taken from BRIGHT TheoT dataset, qid: examples-TheoremQA_wenhuchen/eigen_value1.json
    "Imagine you have a magical box that transforms any object you put inside it, where the object is represented by the column vector x = (x_1, x_2). The box's transformation can be represented by the matrix A = [[5, 4], [1, 2]], so when given an object x, the box outputs the new object Ax. On some special objects, this new object is just a constant multiple of the original object, λx = (λx_1, λx_2). Find both possible values of λ where this occurs — note that these are the box's eigenvalues.",
    # taken from BRIGHT TheoT dataset, qid: examples-TheoremQA_maxku/ipnetwork13-hammingdist.json
    "Imagine you're comparing three digital images that are extremely simplified down to a grid of 5 pixels each, represented by either black (0) or white (1) pixels. The images are as follows: Image A: 00000, Image B: 10101, and Image C: 01010. By counting the number of pixels that differ between each pair of images, find the smallest number of differing pixels."
]
documents = [
    # taken from BRIGHT TheoT dataset, docid: 2723
    "\\begin{definition}[Definition:Eigenvector/Linear Operator]\nLet $K$ be a field.\nLet $V$ be a vector space over $K$. \nLet $A : V \\to V$ be a linear operator.\nLet $\\lambda \\in K$ be an eigenvalue of $A$.\nA non-zero vector $v \\in V$ is an '''eigenvector corresponding to $\\lambda$''' {{iff}}:\n:$v \\in \\map \\ker {A - \\lambda I}$\nwhere: \n:$I : V \\to V$ is the identity mapping on $V$\n:$\\map \\ker {A - \\lambda I}$ denotes the kernel of $A - \\lambda I$.\nThat is, {{iff}}: \n:$A v = \\lambda v$\n\\end{definition}",
    # taken from BRIGHT TheoT dataset, docid: 14101
    "\\section{Error Correction Capability of Linear Code}\nTags: Linear Codes\n\n\\begin{theorem}\nLet $C$ be a linear code.\nLet $C$ have a minimum distance $d$.\nThen $C$ corrects $e$ transmission errors for all $e$ such that $2 e + 1 \\le d$.\n\\end{theorem}\n\n\\begin{proof}\nLet $C$ be a linear code whose master code is $V$.\nLet $c \\in C$ be a transmitted codeword.\nLet $v$ be the received word from $c$.\nBy definition, $v$ is an element of $V$.\nLet $v$ have a distance $e$ from $c$, where $2 e + 1 \\le d$.\nThus there have been $e$ transmission errors.\n{{AimForCont}} $c_1$ is a codeword of $C$, distinct from $c$, such that $\\map d {v, c_1} \\le e$.\nThen:\n{{begin-eqn}}\n{{eqn | l = \\map d {c, c_1}\n      | o = \\le\n      | r = \\map d {c, v} + \\map d {v, c_1}\n      | c = \n}}\n{{eqn | o = \\le\n      | r = e + e\n      | c = \n}}\n{{eqn | o = <\n      | r = d\n      | c = \n}}\n{{end-eqn}}\nSo $c_1$ has a distance from $c$ less than $d$.\nBut $C$ has a minimum distance $d$.\nThus $c_1$ cannot be a codeword of $C$.\nFrom this contradiction it follows that there is no codeword of $C$ closer to $v$ than $c$.\nHence there is a unique codeword of $C$ which has the smallest distance from $v$.\nHence it can be understood that $C$ has corrected the transmission errors of $v$.\n{{Qed}}\n\\end{proof}\n\n"
]
model = FlagLLMModel("hanhainebula/reason-embed-llama-3.1-8b-0928", 
                     query_instruction_for_retrieval="Given a Math problem, retrieve relevant theorems that help answer the problem.",
                     query_instruction_format="Instruct: {}\nQuery: {}",
                     devices="cuda:0",  # set devices to "cuda:0" for testing on a single GPU
                     use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation
embeddings_1 = model.encode_queries(queries)
embeddings_2 = model.encode_corpus(documents)
similarity = embeddings_1 @ embeddings_2.T
print(similarity)
```


### Using Sentence Transformers

```python
from sentence_transformers import SentenceTransformer
import torch

# Load the model, optionally in float16 precision for faster inference
model = SentenceTransformer("hanhainebula/reason-embed-llama-3.1-8b-0928", model_kwargs={"torch_dtype": torch.float16})


queries = [
    # taken from BRIGHT TheoT dataset, qid: examples-TheoremQA_wenhuchen/eigen_value1.json
    "Imagine you have a magical box that transforms any object you put inside it, where the object is represented by the column vector x = (x_1, x_2). The box's transformation can be represented by the matrix A = [[5, 4], [1, 2]], so when given an object x, the box outputs the new object Ax. On some special objects, this new object is just a constant multiple of the original object, λx = (λx_1, λx_2). Find both possible values of λ where this occurs — note that these are the box's eigenvalues.",
    # taken from BRIGHT TheoT dataset, qid: examples-TheoremQA_maxku/ipnetwork13-hammingdist.json
    "Imagine you're comparing three digital images that are extremely simplified down to a grid of 5 pixels each, represented by either black (0) or white (1) pixels. The images are as follows: Image A: 00000, Image B: 10101, and Image C: 01010. By counting the number of pixels that differ between each pair of images, find the smallest number of differing pixels."
]
documents = [
    # taken from BRIGHT TheoT dataset, docid: 2723
    "\\begin{definition}[Definition:Eigenvector/Linear Operator]\nLet $K$ be a field.\nLet $V$ be a vector space over $K$. \nLet $A : V \\to V$ be a linear operator.\nLet $\\lambda \\in K$ be an eigenvalue of $A$.\nA non-zero vector $v \\in V$ is an '''eigenvector corresponding to $\\lambda$''' {{iff}}:\n:$v \\in \\map \\ker {A - \\lambda I}$\nwhere: \n:$I : V \\to V$ is the identity mapping on $V$\n:$\\map \\ker {A - \\lambda I}$ denotes the kernel of $A - \\lambda I$.\nThat is, {{iff}}: \n:$A v = \\lambda v$\n\\end{definition}",
    # taken from BRIGHT TheoT dataset, docid: 14101
    "\\section{Error Correction Capability of Linear Code}\nTags: Linear Codes\n\n\\begin{theorem}\nLet $C$ be a linear code.\nLet $C$ have a minimum distance $d$.\nThen $C$ corrects $e$ transmission errors for all $e$ such that $2 e + 1 \\le d$.\n\\end{theorem}\n\n\\begin{proof}\nLet $C$ be a linear code whose master code is $V$.\nLet $c \\in C$ be a transmitted codeword.\nLet $v$ be the received word from $c$.\nBy definition, $v$ is an element of $V$.\nLet $v$ have a distance $e$ from $c$, where $2 e + 1 \\le d$.\nThus there have been $e$ transmission errors.\n{{AimForCont}} $c_1$ is a codeword of $C$, distinct from $c$, such that $\\map d {v, c_1} \\le e$.\nThen:\n{{begin-eqn}}\n{{eqn | l = \\map d {c, c_1}\n      | o = \\le\n      | r = \\map d {c, v} + \\map d {v, c_1}\n      | c = \n}}\n{{eqn | o = \\le\n      | r = e + e\n      | c = \n}}\n{{eqn | o = <\n      | r = d\n      | c = \n}}\n{{end-eqn}}\nSo $c_1$ has a distance from $c$ less than $d$.\nBut $C$ has a minimum distance $d$.\nThus $c_1$ cannot be a codeword of $C$.\nFrom this contradiction it follows that there is no codeword of $C$ closer to $v$ than $c$.\nHence there is a unique codeword of $C$ which has the smallest distance from $v$.\nHence it can be understood that $C$ has corrected the transmission errors of $v$.\n{{Qed}}\n\\end{proof}\n\n"
]

query_embeddings = model.encode(queries, prompt="Instruct: Given a Math problem, retrieve relevant theorems that help answer the problem.\nQuery: ")
document_embeddings = model.encode(documents)

# Compute the (cosine) similarity between the query and document embeddings
similarity = model.similarity(query_embeddings, document_embeddings)
print(similarity)
```


### Using HuggingFace Transformers
```python
import torch
import torch.nn.functional as F

from torch import Tensor
from transformers import AutoTokenizer, AutoModel


def last_token_pool(last_hidden_states: Tensor,
                 attention_mask: Tensor) -> Tensor:
    left_padding = (attention_mask[:, -1].sum() == attention_mask.shape[0])
    if left_padding:
        return last_hidden_states[:, -1]
    else:
        sequence_lengths = attention_mask.sum(dim=1) - 1
        batch_size = last_hidden_states.shape[0]
        return last_hidden_states[torch.arange(batch_size, device=last_hidden_states.device), sequence_lengths]


def get_detailed_instruct(task_description: str, query: str) -> str:
    return f'Instruct: {task_description}\nQuery: {query}'


def tokenize_texts(tokenizer, texts, max_length: int, device: str):
    batch_dict = tokenizer(texts, max_length=max_length, padding=True, truncation=True, return_tensors='pt', pad_to_multiple_of=8)
    batch_dict = {k: v.to(device) for k, v in batch_dict.items()}
    return batch_dict


task = 'Given a Math problem, retrieve relevant theorems that help answer the problem.'
queries = [
    # taken from BRIGHT TheoT dataset, qid: examples-TheoremQA_wenhuchen/eigen_value1.json
    "Imagine you have a magical box that transforms any object you put inside it, where the object is represented by the column vector x = (x_1, x_2). The box's transformation can be represented by the matrix A = [[5, 4], [1, 2]], so when given an object x, the box outputs the new object Ax. On some special objects, this new object is just a constant multiple of the original object, λx = (λx_1, λx_2). Find both possible values of λ where this occurs — note that these are the box's eigenvalues.",
    # taken from BRIGHT TheoT dataset, qid: examples-TheoremQA_maxku/ipnetwork13-hammingdist.json
    "Imagine you're comparing three digital images that are extremely simplified down to a grid of 5 pixels each, represented by either black (0) or white (1) pixels. The images are as follows: Image A: 00000, Image B: 10101, and Image C: 01010. By counting the number of pixels that differ between each pair of images, find the smallest number of differing pixels."
]
queries = [get_detailed_instruct(task, q) for q in queries]
documents = [
    # taken from BRIGHT TheoT dataset, docid: 2723
    "\\begin{definition}[Definition:Eigenvector/Linear Operator]\nLet $K$ be a field.\nLet $V$ be a vector space over $K$. \nLet $A : V \\to V$ be a linear operator.\nLet $\\lambda \\in K$ be an eigenvalue of $A$.\nA non-zero vector $v \\in V$ is an '''eigenvector corresponding to $\\lambda$''' {{iff}}:\n:$v \\in \\map \\ker {A - \\lambda I}$\nwhere: \n:$I : V \\to V$ is the identity mapping on $V$\n:$\\map \\ker {A - \\lambda I}$ denotes the kernel of $A - \\lambda I$.\nThat is, {{iff}}: \n:$A v = \\lambda v$\n\\end{definition}",
    # taken from BRIGHT TheoT dataset, docid: 14101
    "\\section{Error Correction Capability of Linear Code}\nTags: Linear Codes\n\n\\begin{theorem}\nLet $C$ be a linear code.\nLet $C$ have a minimum distance $d$.\nThen $C$ corrects $e$ transmission errors for all $e$ such that $2 e + 1 \\le d$.\n\\end{theorem}\n\n\\begin{proof}\nLet $C$ be a linear code whose master code is $V$.\nLet $c \\in C$ be a transmitted codeword.\nLet $v$ be the received word from $c$.\nBy definition, $v$ is an element of $V$.\nLet $v$ have a distance $e$ from $c$, where $2 e + 1 \\le d$.\nThus there have been $e$ transmission errors.\n{{AimForCont}} $c_1$ is a codeword of $C$, distinct from $c$, such that $\\map d {v, c_1} \\le e$.\nThen:\n{{begin-eqn}}\n{{eqn | l = \\map d {c, c_1}\n      | o = \\le\n      | r = \\map d {c, v} + \\map d {v, c_1}\n      | c = \n}}\n{{eqn | o = \\le\n      | r = e + e\n      | c = \n}}\n{{eqn | o = <\n      | r = d\n      | c = \n}}\n{{end-eqn}}\nSo $c_1$ has a distance from $c$ less than $d$.\nBut $C$ has a minimum distance $d$.\nThus $c_1$ cannot be a codeword of $C$.\nFrom this contradiction it follows that there is no codeword of $C$ closer to $v$ than $c$.\nHence there is a unique codeword of $C$ which has the smallest distance from $v$.\nHence it can be understood that $C$ has corrected the transmission errors of $v$.\n{{Qed}}\n\\end{proof}\n\n"
]

tokenizer = AutoTokenizer.from_pretrained("hanhainebula/reason-embed-llama-3.1-8b-0928")
model = AutoModel.from_pretrained("hanhainebula/reason-embed-llama-3.1-8b-0928")
model.eval()

device = "cuda:0"   # set device to "cuda:0" for testing on a single GPU
model.to(device)
model.half()

max_length = 512
# Tokenize the input texts
query_batch_dict = tokenize_texts(tokenizer, queries, max_length, device)
doc_batch_dict = tokenize_texts(tokenizer, documents, max_length, device)

with torch.no_grad():
    query_outputs = model(**query_batch_dict)
    query_embeddings = last_token_pool(query_outputs.last_hidden_state, query_batch_dict['attention_mask'])
    
    doc_outputs = model(**doc_batch_dict)
    doc_embeddings = last_token_pool(doc_outputs.last_hidden_state, doc_batch_dict['attention_mask'])

# normalize embeddings
query_embeddings = F.normalize(query_embeddings, p=2, dim=1)
doc_embeddings = F.normalize(doc_embeddings, p=2, dim=1)
scores = (query_embeddings @ doc_embeddings.T) * 100
print(scores.cpu().tolist())
```


## Citation

If you find this repository useful, please consider giving a star ⭐ and citation:
```
@article{chen2025reasonembed,
  title={ReasonEmbed: Enhanced Text Embeddings for Reasoning-Intensive Document Retrieval},
  author={Chen, Jianlyu and Lan, Junwei and Li, Chaofan and Lian, Defu and Liu, Zheng},
  journal={arXiv preprint arXiv:2510.08252},
  year={2025}
}
```