update
Browse files
README.md
CHANGED
|
@@ -2617,6 +2617,7 @@ language:
|
|
| 2617 |
<a href="#evaluation">Evaluation</a> |
|
| 2618 |
<a href="#train">Train</a> |
|
| 2619 |
<a href="#contact">Contact</a> |
|
|
|
|
| 2620 |
<a href="#license">License</a>
|
| 2621 |
<p>
|
| 2622 |
</h4>
|
|
@@ -2630,6 +2631,7 @@ FlagEmbedding can map any text to a low-dimensional dense vector which can be us
|
|
| 2630 |
And it also can be used in vector databases for LLMs.
|
| 2631 |
|
| 2632 |
************* 🌟**Updates**🌟 *************
|
|
|
|
| 2633 |
- 09/12/2023: New Release:
|
| 2634 |
- **New reranker model**: release cross-encoder models `BAAI/bge-reranker-base` and `BAAI/bge-reranker-large`, which are more powerful than embedding model. We recommend to use/fine-tune them to re-rank top-k documents returned by embedding models.
|
| 2635 |
- **update embedding model**: release `bge-*-v1.5` embedding model to alleviate the issue of the similarity distribution, and enhance its retrieval ability without instruction.
|
|
@@ -2664,10 +2666,9 @@ And it also can be used in vector databases for LLMs.
|
|
| 2664 |
|
| 2665 |
\*: If you need to search the relevant passages to a query, we suggest to add the instruction to the query; in other cases, no instruction is needed, just use the original query directly. In all cases, **no instruction** needs to be added to passages.
|
| 2666 |
|
| 2667 |
-
\**: Different embedding model, reranker
|
| 2668 |
For examples, use bge embedding model to retrieve top 100 relevant documents, and then use bge reranker to re-rank the top 100 document to get the final top-3 results.
|
| 2669 |
|
| 2670 |
-
|
| 2671 |
## Frequently asked questions
|
| 2672 |
|
| 2673 |
<details>
|
|
@@ -2730,7 +2731,9 @@ If it doesn't work for you, you can see [FlagEmbedding](https://github.com/FlagO
|
|
| 2730 |
from FlagEmbedding import FlagModel
|
| 2731 |
sentences_1 = ["样例数据-1", "样例数据-2"]
|
| 2732 |
sentences_2 = ["样例数据-3", "样例数据-4"]
|
| 2733 |
-
model = FlagModel('BAAI/bge-large-zh',
|
|
|
|
|
|
|
| 2734 |
embeddings_1 = model.encode(sentences_1)
|
| 2735 |
embeddings_2 = model.encode(sentences_2)
|
| 2736 |
similarity = embeddings_1 @ embeddings_2.T
|
|
@@ -2761,7 +2764,7 @@ pip install -U sentence-transformers
|
|
| 2761 |
from sentence_transformers import SentenceTransformer
|
| 2762 |
sentences_1 = ["样例数据-1", "样例数据-2"]
|
| 2763 |
sentences_2 = ["样例数据-3", "样例数据-4"]
|
| 2764 |
-
model = SentenceTransformer('BAAI/bge-large-zh')
|
| 2765 |
embeddings_1 = model.encode(sentences_1, normalize_embeddings=True)
|
| 2766 |
embeddings_2 = model.encode(sentences_2, normalize_embeddings=True)
|
| 2767 |
similarity = embeddings_1 @ embeddings_2.T
|
|
@@ -2776,7 +2779,7 @@ queries = ['query_1', 'query_2']
|
|
| 2776 |
passages = ["样例文档-1", "样例文档-2"]
|
| 2777 |
instruction = "为这个句子生成表示以用于检索相关文章:"
|
| 2778 |
|
| 2779 |
-
model = SentenceTransformer('BAAI/bge-large-zh')
|
| 2780 |
q_embeddings = model.encode([instruction+q for q in queries], normalize_embeddings=True)
|
| 2781 |
p_embeddings = model.encode(passages, normalize_embeddings=True)
|
| 2782 |
scores = q_embeddings @ p_embeddings.T
|
|
@@ -2787,7 +2790,7 @@ scores = q_embeddings @ p_embeddings.T
|
|
| 2787 |
You can use `bge` in langchain like this:
|
| 2788 |
```python
|
| 2789 |
from langchain.embeddings import HuggingFaceBgeEmbeddings
|
| 2790 |
-
model_name = "BAAI/bge-
|
| 2791 |
model_kwargs = {'device': 'cuda'}
|
| 2792 |
encode_kwargs = {'normalize_embeddings': True} # set True to compute cosine similarity
|
| 2793 |
model = HuggingFaceBgeEmbeddings(
|
|
@@ -2811,8 +2814,8 @@ import torch
|
|
| 2811 |
sentences = ["样例数据-1", "样例数据-2"]
|
| 2812 |
|
| 2813 |
# Load model from HuggingFace Hub
|
| 2814 |
-
tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-large-zh')
|
| 2815 |
-
model = AutoModel.from_pretrained('BAAI/bge-large-zh')
|
| 2816 |
model.eval()
|
| 2817 |
|
| 2818 |
# Tokenize sentences
|
|
@@ -2832,6 +2835,7 @@ print("Sentence embeddings:", sentence_embeddings)
|
|
| 2832 |
|
| 2833 |
### Usage for Reranker
|
| 2834 |
|
|
|
|
| 2835 |
You can get a relevance score by inputting query and passage to the reranker.
|
| 2836 |
The reranker is optimized based cross-entropy loss, so the relevance score is not bounded to a specific range.
|
| 2837 |
|
|
@@ -2841,10 +2845,10 @@ The reranker is optimized based cross-entropy loss, so the relevance score is no
|
|
| 2841 |
pip install -U FlagEmbedding
|
| 2842 |
```
|
| 2843 |
|
| 2844 |
-
Get relevance
|
| 2845 |
```python
|
| 2846 |
from FlagEmbedding import FlagReranker
|
| 2847 |
-
reranker = FlagReranker('BAAI/bge-reranker-
|
| 2848 |
|
| 2849 |
score = reranker.compute_score(['query', 'passage'])
|
| 2850 |
print(score)
|
|
@@ -2858,10 +2862,10 @@ print(scores)
|
|
| 2858 |
|
| 2859 |
```python
|
| 2860 |
import torch
|
| 2861 |
-
from transformers import AutoModelForSequenceClassification, AutoTokenizer
|
| 2862 |
|
| 2863 |
-
tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-reranker-
|
| 2864 |
-
model = AutoModelForSequenceClassification.from_pretrained('BAAI/bge-reranker-
|
| 2865 |
model.eval()
|
| 2866 |
|
| 2867 |
pairs = [['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']]
|
|
@@ -2927,7 +2931,7 @@ Please refer to [C_MTEB](https://github.com/FlagOpen/FlagEmbedding/blob/master/C
|
|
| 2927 |
- **Reranking**:
|
| 2928 |
See [C_MTEB](https://github.com/FlagOpen/FlagEmbedding/blob/master/C_MTEB/) for evaluation script.
|
| 2929 |
|
| 2930 |
-
| Model | T2Reranking | T2RerankingZh2En\* | T2RerankingEn2Zh\* |
|
| 2931 |
|:-------------------------------|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|
|
| 2932 |
| text2vec-base-multilingual | 64.66 | 62.94 | 62.51 | 14.37 | 48.46 | 48.6 | 50.26 |
|
| 2933 |
| multilingual-e5-small | 65.62 | 60.94 | 56.41 | 29.91 | 67.26 | 66.54 | 57.78 |
|
|
@@ -2940,13 +2944,13 @@ See [C_MTEB](https://github.com/FlagOpen/FlagEmbedding/blob/master/C_MTEB/) for
|
|
| 2940 |
| [BAAI/bge-reranker-base](https://huggingface.co/BAAI/bge-reranker-base) | 67.28 | 63.95 | 60.45 | 35.46 | 81.26 | 84.1 | 65.42 |
|
| 2941 |
| [BAAI/bge-reranker-large](https://huggingface.co/BAAI/bge-reranker-large) | 67.6 | 64.03 | 61.44 | 37.16 | 82.15 | 84.18 | 66.09 |
|
| 2942 |
|
| 2943 |
-
\* : T2RerankingZh2En and T2RerankingEn2Zh are cross-language retrieval
|
| 2944 |
|
| 2945 |
## Train
|
| 2946 |
|
| 2947 |
### BAAI Embedding
|
| 2948 |
|
| 2949 |
-
We pre-train the models using retromae and train them on large-scale pairs data using contrastive learning.
|
| 2950 |
**You can fine-tune the embedding model on your data following our [examples](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune).**
|
| 2951 |
We also provide a [pre-train example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/pretrain).
|
| 2952 |
Note that the goal of pre-training is to reconstruct the text, and the pre-trained model cannot be used for similarity calculation directly, it needs to be fine-tuned.
|
|
@@ -2969,6 +2973,20 @@ If you have any question or suggestion related to this project, feel free to ope
|
|
| 2969 |
You also can email Shitao Xiao([email protected]) and Zheng Liu([email protected]).
|
| 2970 |
|
| 2971 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2972 |
## License
|
| 2973 |
FlagEmbedding is licensed under the [MIT License](https://github.com/FlagOpen/FlagEmbedding/blob/master/LICENSE). The released models can be used for commercial purposes free of charge.
|
| 2974 |
|
|
|
|
| 2617 |
<a href="#evaluation">Evaluation</a> |
|
| 2618 |
<a href="#train">Train</a> |
|
| 2619 |
<a href="#contact">Contact</a> |
|
| 2620 |
+
<a href="#citation">Citation</a> |
|
| 2621 |
<a href="#license">License</a>
|
| 2622 |
<p>
|
| 2623 |
</h4>
|
|
|
|
| 2631 |
And it also can be used in vector databases for LLMs.
|
| 2632 |
|
| 2633 |
************* 🌟**Updates**🌟 *************
|
| 2634 |
+
- 09/15/2023: Release [paper](https://arxiv.org/pdf/2309.07597.pdf) and [dataset](https://data.baai.ac.cn/details/BAAI-MTP).
|
| 2635 |
- 09/12/2023: New Release:
|
| 2636 |
- **New reranker model**: release cross-encoder models `BAAI/bge-reranker-base` and `BAAI/bge-reranker-large`, which are more powerful than embedding model. We recommend to use/fine-tune them to re-rank top-k documents returned by embedding models.
|
| 2637 |
- **update embedding model**: release `bge-*-v1.5` embedding model to alleviate the issue of the similarity distribution, and enhance its retrieval ability without instruction.
|
|
|
|
| 2666 |
|
| 2667 |
\*: If you need to search the relevant passages to a query, we suggest to add the instruction to the query; in other cases, no instruction is needed, just use the original query directly. In all cases, **no instruction** needs to be added to passages.
|
| 2668 |
|
| 2669 |
+
\**: Different from embedding model, reranker uses question and document as input and directly output similarity instead of embedding. To balance the accuracy and time cost, cross-encoder is widely used to re-rank top-k documents retrieved by other simple models.
|
| 2670 |
For examples, use bge embedding model to retrieve top 100 relevant documents, and then use bge reranker to re-rank the top 100 document to get the final top-3 results.
|
| 2671 |
|
|
|
|
| 2672 |
## Frequently asked questions
|
| 2673 |
|
| 2674 |
<details>
|
|
|
|
| 2731 |
from FlagEmbedding import FlagModel
|
| 2732 |
sentences_1 = ["样例数据-1", "样例数据-2"]
|
| 2733 |
sentences_2 = ["样例数据-3", "样例数据-4"]
|
| 2734 |
+
model = FlagModel('BAAI/bge-large-zh-v1.5',
|
| 2735 |
+
query_instruction_for_retrieval="为这个句子生成表示以用于检索相关文章:",
|
| 2736 |
+
use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation
|
| 2737 |
embeddings_1 = model.encode(sentences_1)
|
| 2738 |
embeddings_2 = model.encode(sentences_2)
|
| 2739 |
similarity = embeddings_1 @ embeddings_2.T
|
|
|
|
| 2764 |
from sentence_transformers import SentenceTransformer
|
| 2765 |
sentences_1 = ["样例数据-1", "样例数据-2"]
|
| 2766 |
sentences_2 = ["样例数据-3", "样例数据-4"]
|
| 2767 |
+
model = SentenceTransformer('BAAI/bge-large-zh-v1.5')
|
| 2768 |
embeddings_1 = model.encode(sentences_1, normalize_embeddings=True)
|
| 2769 |
embeddings_2 = model.encode(sentences_2, normalize_embeddings=True)
|
| 2770 |
similarity = embeddings_1 @ embeddings_2.T
|
|
|
|
| 2779 |
passages = ["样例文档-1", "样例文档-2"]
|
| 2780 |
instruction = "为这个句子生成表示以用于检索相关文章:"
|
| 2781 |
|
| 2782 |
+
model = SentenceTransformer('BAAI/bge-large-zh-v1.5')
|
| 2783 |
q_embeddings = model.encode([instruction+q for q in queries], normalize_embeddings=True)
|
| 2784 |
p_embeddings = model.encode(passages, normalize_embeddings=True)
|
| 2785 |
scores = q_embeddings @ p_embeddings.T
|
|
|
|
| 2790 |
You can use `bge` in langchain like this:
|
| 2791 |
```python
|
| 2792 |
from langchain.embeddings import HuggingFaceBgeEmbeddings
|
| 2793 |
+
model_name = "BAAI/bge-large-en-v1.5"
|
| 2794 |
model_kwargs = {'device': 'cuda'}
|
| 2795 |
encode_kwargs = {'normalize_embeddings': True} # set True to compute cosine similarity
|
| 2796 |
model = HuggingFaceBgeEmbeddings(
|
|
|
|
| 2814 |
sentences = ["样例数据-1", "样例数据-2"]
|
| 2815 |
|
| 2816 |
# Load model from HuggingFace Hub
|
| 2817 |
+
tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-large-zh-v1.5')
|
| 2818 |
+
model = AutoModel.from_pretrained('BAAI/bge-large-zh-v1.5')
|
| 2819 |
model.eval()
|
| 2820 |
|
| 2821 |
# Tokenize sentences
|
|
|
|
| 2835 |
|
| 2836 |
### Usage for Reranker
|
| 2837 |
|
| 2838 |
+
Different from embedding model, reranker uses question and document as input and directly output similarity instead of embedding.
|
| 2839 |
You can get a relevance score by inputting query and passage to the reranker.
|
| 2840 |
The reranker is optimized based cross-entropy loss, so the relevance score is not bounded to a specific range.
|
| 2841 |
|
|
|
|
| 2845 |
pip install -U FlagEmbedding
|
| 2846 |
```
|
| 2847 |
|
| 2848 |
+
Get relevance scores (higher scores indicate more relevance):
|
| 2849 |
```python
|
| 2850 |
from FlagEmbedding import FlagReranker
|
| 2851 |
+
reranker = FlagReranker('BAAI/bge-reranker-large', use_fp16=True) # Setting use_fp16 to True speeds up computation with a slight performance degradation
|
| 2852 |
|
| 2853 |
score = reranker.compute_score(['query', 'passage'])
|
| 2854 |
print(score)
|
|
|
|
| 2862 |
|
| 2863 |
```python
|
| 2864 |
import torch
|
| 2865 |
+
from transformers import AutoModelForSequenceClassification, AutoTokenizer
|
| 2866 |
|
| 2867 |
+
tokenizer = AutoTokenizer.from_pretrained('BAAI/bge-reranker-large')
|
| 2868 |
+
model = AutoModelForSequenceClassification.from_pretrained('BAAI/bge-reranker-large')
|
| 2869 |
model.eval()
|
| 2870 |
|
| 2871 |
pairs = [['what is panda?', 'hi'], ['what is panda?', 'The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.']]
|
|
|
|
| 2931 |
- **Reranking**:
|
| 2932 |
See [C_MTEB](https://github.com/FlagOpen/FlagEmbedding/blob/master/C_MTEB/) for evaluation script.
|
| 2933 |
|
| 2934 |
+
| Model | T2Reranking | T2RerankingZh2En\* | T2RerankingEn2Zh\* | MMarcoReranking | CMedQAv1 | CMedQAv2 | Avg |
|
| 2935 |
|:-------------------------------|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|:--------:|
|
| 2936 |
| text2vec-base-multilingual | 64.66 | 62.94 | 62.51 | 14.37 | 48.46 | 48.6 | 50.26 |
|
| 2937 |
| multilingual-e5-small | 65.62 | 60.94 | 56.41 | 29.91 | 67.26 | 66.54 | 57.78 |
|
|
|
|
| 2944 |
| [BAAI/bge-reranker-base](https://huggingface.co/BAAI/bge-reranker-base) | 67.28 | 63.95 | 60.45 | 35.46 | 81.26 | 84.1 | 65.42 |
|
| 2945 |
| [BAAI/bge-reranker-large](https://huggingface.co/BAAI/bge-reranker-large) | 67.6 | 64.03 | 61.44 | 37.16 | 82.15 | 84.18 | 66.09 |
|
| 2946 |
|
| 2947 |
+
\* : T2RerankingZh2En and T2RerankingEn2Zh are cross-language retrieval tasks
|
| 2948 |
|
| 2949 |
## Train
|
| 2950 |
|
| 2951 |
### BAAI Embedding
|
| 2952 |
|
| 2953 |
+
We pre-train the models using [retromae](https://github.com/staoxiao/RetroMAE) and train them on large-scale pairs data using contrastive learning.
|
| 2954 |
**You can fine-tune the embedding model on your data following our [examples](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/finetune).**
|
| 2955 |
We also provide a [pre-train example](https://github.com/FlagOpen/FlagEmbedding/tree/master/examples/pretrain).
|
| 2956 |
Note that the goal of pre-training is to reconstruct the text, and the pre-trained model cannot be used for similarity calculation directly, it needs to be fine-tuned.
|
|
|
|
| 2973 |
You also can email Shitao Xiao([email protected]) and Zheng Liu([email protected]).
|
| 2974 |
|
| 2975 |
|
| 2976 |
+
## Citation
|
| 2977 |
+
|
| 2978 |
+
If you find our work helpful, please cite us:
|
| 2979 |
+
```
|
| 2980 |
+
@misc{bge_embedding,
|
| 2981 |
+
title={C-Pack: Packaged Resources To Advance General Chinese Embedding},
|
| 2982 |
+
author={Shitao Xiao and Zheng Liu and Peitian Zhang and Niklas Muennighoff},
|
| 2983 |
+
year={2023},
|
| 2984 |
+
eprint={2309.07597},
|
| 2985 |
+
archivePrefix={arXiv},
|
| 2986 |
+
primaryClass={cs.CL}
|
| 2987 |
+
}
|
| 2988 |
+
```
|
| 2989 |
+
|
| 2990 |
## License
|
| 2991 |
FlagEmbedding is licensed under the [MIT License](https://github.com/FlagOpen/FlagEmbedding/blob/master/LICENSE). The released models can be used for commercial purposes free of charge.
|
| 2992 |
|