davidmezzetti commited on
Commit
41b0ae4
·
1 Parent(s): 200f84a

Upload model

Browse files
1_SpladePooling/config.json ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ {
2
+ "pooling_strategy": "max",
3
+ "activation_function": "relu",
4
+ "word_embedding_dimension": 30522
5
+ }
README.md ADDED
@@ -0,0 +1,127 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - sentence-transformers
4
+ - sparse-encoder
5
+ - sparse
6
+ - splade
7
+ - generated_from_trainer
8
+ - loss:SpladeLoss
9
+ - loss:SparseMultipleNegativesRankingLoss
10
+ - loss:FlopsLoss
11
+ base_model: microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext
12
+ pipeline_tag: sentence-similarity
13
+ library_name: sentence-transformers
14
+ metrics:
15
+ - pearson_cosine
16
+ - spearman_cosine
17
+ - active_dims
18
+ - sparsity_ratio
19
+ model-index:
20
+ - name: SPLADE Sparse Encoder
21
+ results:
22
+ - task:
23
+ type: semantic-similarity
24
+ name: Semantic Similarity
25
+ metrics:
26
+ - type: pearson_cosine
27
+ value: 0.9422980731390805
28
+ name: Pearson Cosine
29
+ - type: spearman_cosine
30
+ value: 0.8870061609483617
31
+ name: Spearman Cosine
32
+ - type: active_dims
33
+ value: 34.0018196105957
34
+ name: Active Dims
35
+ - type: sparsity_ratio
36
+ value: 0.9988859897906233
37
+ name: Sparsity Ratio
38
+ language: en
39
+ license: apache-2.0
40
+ ---
41
+
42
+ # PubMedBERT SPLADE
43
+
44
+ This is a [SPLADE Sparse Encoder](https://www.sbert.net/docs/sparse_encoder/usage/usage.html) model finetuned from [PubMedBERT-base](https://huggingface.co/microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext) using [sentence-transformers](https://www.SBERT.net). It maps sentences & paragraphs to a 30522-dimensional sparse vector space and can be used for semantic search and sparse retrieval.
45
+
46
+ The training dataset was generated using a random sample of [PubMed](https://pubmed.ncbi.nlm.nih.gov/) title-abstract pairs along with similar title pairs.
47
+
48
+ PubMedBERT SPLADE produces higher quality sparse embeddings than generalized models for medical literature. Further fine-tuning for a medical subdomain will result in even better performance.
49
+
50
+ ## Usage (txtai)
51
+
52
+ This model can be used to build embeddings databases with [txtai](https://github.com/neuml/txtai) for semantic search and/or as a knowledge source for retrieval augmented generation (RAG).
53
+
54
+ _Note: txtai 8.7.0+ is required for sparse vector scoring support_
55
+
56
+ ```python
57
+ import txtai
58
+
59
+ embeddings = txtai.Embeddings(sparse="neuml/pubmedbert-base-splade", content=True)
60
+ embeddings.index(documents())
61
+
62
+ # Run a query
63
+ embeddings.search("query to run")
64
+ ```
65
+
66
+ ## Usage (Sentence-Transformers)
67
+
68
+ Alternatively, the model can be loaded with [sentence-transformers](https://www.SBERT.net).
69
+
70
+ ```python
71
+ from sentence_transformers import SpladeEncoder
72
+ sentences = ["This is an example sentence", "Each sentence is converted"]
73
+
74
+ model = SpladeEncoder("neuml/pubmedbert-base-splade")
75
+ embeddings = model.encode(sentences)
76
+ print(embeddings)
77
+ ```
78
+
79
+ ## Evaluation Results
80
+
81
+ Performance of this model compared to the top base models on the [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard) is shown below. A popular smaller model was also evaluated along with the most downloaded PubMed similarity model on the Hugging Face Hub.
82
+
83
+ The following datasets were used to evaluate model performance.
84
+
85
+ - [PubMed QA](https://huggingface.co/datasets/qiaojin/PubMedQA)
86
+ - Subset: pqa_labeled, Split: train, Pair: (question, long_answer)
87
+ - [PubMed Subset](https://huggingface.co/datasets/awinml/pubmed_abstract_3_1k)
88
+ - Split: test, Pair: (title, text)
89
+ - [PubMed Summary](https://huggingface.co/datasets/armanc/scientific_papers)
90
+ - Subset: pubmed, Split: validation, Pair: (article, abstract)
91
+
92
+ Evaluation results are shown below. The [Pearson correlation coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) is used as the evaluation metric.
93
+
94
+ | Model | PubMed QA | PubMed Subset | PubMed Summary | Average |
95
+ | ----------------------------------------------------------------------------- | --------- | ------------- | -------------- | --------- |
96
+ | [all-MiniLM-L6-v2](https://hf.co/sentence-transformers/all-MiniLM-L6-v2) | 90.40 | 95.92 | 94.07 | 93.46 |
97
+ | [bge-base-en-v1.5](https://hf.co/BAAI/bge-base-en-v1.5) | 91.02 | 95.82 | 94.49 | 93.78 |
98
+ | [gte-base](https://hf.co/thenlper/gte-base) | 92.97 | 96.90 | 96.24 | 95.37 |
99
+ | [pubmedbert-base-embeddings](https://hf.co/neuml/pubmedbert-base-embeddings) | 93.27 | 97.00 | 96.58 | 95.62 |
100
+ | [**pubmedbert-base-splade**](https://hf.co/neuml/pubmedbert-base-splade) | **90.76** | **96.20** | **95.87** | **94.28** |
101
+ | [S-PubMedBert-MS-MARCO](https://hf.co/pritamdeka/S-PubMedBert-MS-MARCO) | 90.86 | 93.68 | 93.54 | 92.69 |
102
+
103
+ While this model was't the highest scoring model using the Pearson metric, it does well when measured by [Spearman rank correlation coefficient](https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient).
104
+
105
+ | Model | PubMed QA | PubMed Subset | PubMed Summary | Average |
106
+ | ----------------------------------------------------------------------------- | --------- | ------------- | -------------- | --------- |
107
+ | [all-MiniLM-L6-v2](https://hf.co/sentence-transformers/all-MiniLM-L6-v2) | 85.77 | 86.52 | 86.32 | 86.20 |
108
+ | [bge-base-en-v1.5](https://hf.co/BAAI/bge-base-en-v1.5) | 85.71 | 86.58 | 86.35 | 86.21 |
109
+ | [gte-base](https://hf.co/thenlper/gte-base) | 86.44 | 86.60 | 86.55 | 86.53 |
110
+ | [pubmedbert-base-embeddings](https://hf.co/neuml/pubmedbert-base-embeddings) | 86.29 | 86.57 | 86.47 | 86.44 |
111
+ | [**pubmedbert-base-splade**](https://hf.co/neuml/pubmedbert-base-splade) | **86.80** | **89.12** | **88.60** | **88.17** |
112
+ | [S-PubMedBert-MS-MARCO](https://hf.co/pritamdeka/S-PubMedBert-MS-MARCO) | 85.71 | 86.37 | 86.13 | 86.07 |
113
+
114
+ This indicates that the SPLADE model may do a better job of calculating scores/rankings in the correct direction.
115
+
116
+ ### Full Model Architecture
117
+
118
+ ```
119
+ SparseEncoder(
120
+ (0): MLMTransformer({'max_seq_length': 512, 'do_lower_case': False, 'architecture': 'BertForMaskedLM'})
121
+ (1): SpladePooling({'pooling_strategy': 'max', 'activation_function': 'relu', 'word_embedding_dimension': 30522})
122
+ )
123
+ ```
124
+
125
+ ## More Information
126
+
127
+ The training data for this model is the same as described in [this article](https://medium.com/neuml/embeddings-for-medical-literature-74dae6abf5e0). See [this article](https://huggingface.co/blog/train-sparse-encoder) for more on the training scripts.
config.json ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "BertForMaskedLM"
4
+ ],
5
+ "attention_probs_dropout_prob": 0.1,
6
+ "classifier_dropout": null,
7
+ "hidden_act": "gelu",
8
+ "hidden_dropout_prob": 0.1,
9
+ "hidden_size": 768,
10
+ "initializer_range": 0.02,
11
+ "intermediate_size": 3072,
12
+ "layer_norm_eps": 1e-12,
13
+ "max_position_embeddings": 512,
14
+ "model_type": "bert",
15
+ "num_attention_heads": 12,
16
+ "num_hidden_layers": 12,
17
+ "pad_token_id": 0,
18
+ "position_embedding_type": "absolute",
19
+ "torch_dtype": "float32",
20
+ "transformers_version": "4.52.4",
21
+ "type_vocab_size": 2,
22
+ "use_cache": true,
23
+ "vocab_size": 30522
24
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model_type": "SparseEncoder",
3
+ "__version__": {
4
+ "sentence_transformers": "5.0.0",
5
+ "transformers": "4.52.4",
6
+ "pytorch": "2.6.0+cu124"
7
+ },
8
+ "prompts": {
9
+ "query": "",
10
+ "document": ""
11
+ },
12
+ "default_prompt_name": null,
13
+ "similarity_fn_name": "dot"
14
+ }
eval/similarity_evaluation_results.csv ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ epoch,steps,cosine_pearson,cosine_spearman,active_dims,sparsity_ratio
2
+ 0.15855147373594838,10000,0.9305500683984587,0.8662759073144441,74.58932876586914,0.9975562109702553
3
+ 0.31710294747189677,20000,0.9343322626967523,0.8742193615544332,54.046302795410156,0.9982292673220821
4
+ 0.4756544212078451,30000,0.9333958913897029,0.8842119466991207,37.624534606933594,0.9987672978636087
5
+ 0.6342058949437935,40000,0.9361898560379817,0.8862825807212503,35.35245227813721,0.9988417386711835
6
+ 0.7927573686797419,50000,0.9422040027533072,0.8842805139950729,35.53547668457031,0.9988357421962988
7
+ 0.9513088424156902,60000,0.9422980731390805,0.8870061609483617,34.0018196105957,0.9988859897906233
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2ef69c3eca29c7451141d84e9cfe00f908b9ac56e9ff5189ab803ad74b37eb65
3
+ size 438080896
modules.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.sparse_encoder.models.MLMTransformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_SpladePooling",
12
+ "type": "sentence_transformers.sparse_encoder.models.SpladePooling"
13
+ }
14
+ ]
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 512,
3
+ "do_lower_case": false
4
+ }
similarity_evaluation_results.csv ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ epoch,steps,cosine_pearson,cosine_spearman,active_dims,sparsity_ratio
2
+ -1,-1,0.9492338008338339,0.8891274940266801,32.76852989196777,0.998926396373371
special_tokens_map.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": "[CLS]",
3
+ "mask_token": "[MASK]",
4
+ "pad_token": "[PAD]",
5
+ "sep_token": "[SEP]",
6
+ "unk_token": "[UNK]"
7
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,58 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "4": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": true,
45
+ "cls_token": "[CLS]",
46
+ "do_basic_tokenize": true,
47
+ "do_lower_case": true,
48
+ "extra_special_tokens": {},
49
+ "mask_token": "[MASK]",
50
+ "model_max_length": 1000000000000000019884624838656,
51
+ "never_split": null,
52
+ "pad_token": "[PAD]",
53
+ "sep_token": "[SEP]",
54
+ "strip_accents": null,
55
+ "tokenize_chinese_chars": true,
56
+ "tokenizer_class": "BertTokenizer",
57
+ "unk_token": "[UNK]"
58
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff