Upload model

Browse files

Files changed (13) hide show

1_SpladePooling/config.json +5 -0
README.md +127 -0
config.json +24 -0
config_sentence_transformers.json +14 -0
eval/similarity_evaluation_results.csv +7 -0
model.safetensors +3 -0
modules.json +14 -0
sentence_bert_config.json +4 -0
similarity_evaluation_results.csv +2 -0
special_tokens_map.json +7 -0
tokenizer.json +0 -0
tokenizer_config.json +58 -0
vocab.txt +0 -0

1_SpladePooling/config.json ADDED Viewed

	@@ -0,0 +1,5 @@

+{
+    "pooling_strategy": "max",
+    "activation_function": "relu",
+    "word_embedding_dimension": 30522
+}

README.md ADDED Viewed

	@@ -0,0 +1,127 @@

+---
+tags:
+- sentence-transformers
+- sparse-encoder
+- sparse
+- splade
+- generated_from_trainer
+- loss:SpladeLoss
+- loss:SparseMultipleNegativesRankingLoss
+- loss:FlopsLoss
+base_model: microsoft/BiomedNLP-BiomedBERT-base-uncased-abstract-fulltext
+pipeline_tag: sentence-similarity
+library_name: sentence-transformers
+metrics:
+- pearson_cosine
+- spearman_cosine
+- active_dims
+- sparsity_ratio
+model-index:
+- name: SPLADE Sparse Encoder
+  results:
+  - task:
+      type: semantic-similarity
+      name: Semantic Similarity
+    metrics:
+    - type: pearson_cosine
+      value: 0.9422980731390805
+      name: Pearson Cosine
+    - type: spearman_cosine
+      value: 0.8870061609483617
+      name: Spearman Cosine
+    - type: active_dims
+      value: 34.0018196105957
+      name: Active Dims
+    - type: sparsity_ratio
+      value: 0.9988859897906233
+      name: Sparsity Ratio
+language: en
+license: apache-2.0
+---
+# PubMedBERT SPLADE
+This is a [SPLADE Sparse Encoder](https://www.sbert.net/docs/sparse_encoder/usage/usage.html) model finetuned from [PubMedBERT-base](https://huggingface.co/microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext) using [sentence-transformers](https://www.SBERT.net). It maps sentences & paragraphs to a 30522-dimensional sparse vector space and can be used for semantic search and sparse retrieval.
+The training dataset was generated using a random sample of [PubMed](https://pubmed.ncbi.nlm.nih.gov/) title-abstract pairs along with similar title pairs.
+PubMedBERT SPLADE produces higher quality sparse embeddings than generalized models for medical literature. Further fine-tuning for a medical subdomain will result in even better performance.
+## Usage (txtai)
+This model can be used to build embeddings databases with [txtai](https://github.com/neuml/txtai) for semantic search and/or as a knowledge source for retrieval augmented generation (RAG).
+_Note: txtai 8.7.0+ is required for sparse vector scoring support_
+```python
+import txtai
+embeddings = txtai.Embeddings(sparse="neuml/pubmedbert-base-splade", content=True)
+embeddings.index(documents())
+# Run a query
+embeddings.search("query to run")
+```
+## Usage (Sentence-Transformers)
+Alternatively, the model can be loaded with [sentence-transformers](https://www.SBERT.net).
+```python
+from sentence_transformers import SpladeEncoder
+sentences = ["This is an example sentence", "Each sentence is converted"]
+model = SpladeEncoder("neuml/pubmedbert-base-splade")
+embeddings = model.encode(sentences)
+print(embeddings)
+```
+## Evaluation Results
+Performance of this model compared to the top base models on the [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard) is shown below. A popular smaller model was also evaluated along with the most downloaded PubMed similarity model on the Hugging Face Hub.
+The following datasets were used to evaluate model performance.
+- [PubMed QA](https://huggingface.co/datasets/qiaojin/PubMedQA)
+  - Subset: pqa_labeled, Split: train, Pair: (question, long_answer)
+- [PubMed Subset](https://huggingface.co/datasets/awinml/pubmed_abstract_3_1k)
+  - Split: test, Pair: (title, text)
+- [PubMed Summary](https://huggingface.co/datasets/armanc/scientific_papers)
+  - Subset: pubmed, Split: validation, Pair: (article, abstract)
+Evaluation results are shown below. The [Pearson correlation coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient) is used as the evaluation metric.
+| Model                                                                         | PubMed QA | PubMed Subset | PubMed Summary | Average   |
+| ----------------------------------------------------------------------------- | --------- | ------------- | -------------- | --------- |
+| [all-MiniLM-L6-v2](https://hf.co/sentence-transformers/all-MiniLM-L6-v2)           | 90.40     | 95.92         | 94.07          | 93.46     |
+| [bge-base-en-v1.5](https://hf.co/BAAI/bge-base-en-v1.5)                            | 91.02     | 95.82         | 94.49          | 93.78     |
+| [gte-base](https://hf.co/thenlper/gte-base)                                        | 92.97     | 96.90         | 96.24          | 95.37     |
+| [pubmedbert-base-embeddings](https://hf.co/neuml/pubmedbert-base-embeddings)       | 93.27     | 97.00         | 96.58          | 95.62     |
+| [**pubmedbert-base-splade**](https://hf.co/neuml/pubmedbert-base-splade)       | **90.76**     | **96.20**         | **95.87**          | **94.28**     |
+| [S-PubMedBert-MS-MARCO](https://hf.co/pritamdeka/S-PubMedBert-MS-MARCO)            | 90.86     | 93.68         | 93.54          | 92.69     |
+While this model was't the highest scoring model using the Pearson metric, it does well when measured by [Spearman rank correlation coefficient](https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient).
+| Model                                                                         | PubMed QA | PubMed Subset | PubMed Summary | Average   |
+| ----------------------------------------------------------------------------- | --------- | ------------- | -------------- | --------- |
+| [all-MiniLM-L6-v2](https://hf.co/sentence-transformers/all-MiniLM-L6-v2)           | 85.77     | 86.52         | 86.32          | 86.20     |
+| [bge-base-en-v1.5](https://hf.co/BAAI/bge-base-en-v1.5)                            | 85.71     | 86.58         | 86.35          | 86.21     |
+| [gte-base](https://hf.co/thenlper/gte-base)                                        | 86.44     | 86.60        | 86.55          | 86.53     |
+| [pubmedbert-base-embeddings](https://hf.co/neuml/pubmedbert-base-embeddings)       | 86.29     | 86.57         | 86.47          | 86.44     |
+| [**pubmedbert-base-splade**](https://hf.co/neuml/pubmedbert-base-splade)        | **86.80** | **89.12**     | **88.60**      | **88.17** |
+| [S-PubMedBert-MS-MARCO](https://hf.co/pritamdeka/S-PubMedBert-MS-MARCO)            | 85.71     | 86.37         | 86.13          | 86.07     |
+This indicates that the SPLADE model may do a better job of calculating scores/rankings in the correct direction.
+### Full Model Architecture
+```
+SparseEncoder(
+  (0): MLMTransformer({'max_seq_length': 512, 'do_lower_case': False, 'architecture': 'BertForMaskedLM'})
+  (1): SpladePooling({'pooling_strategy': 'max', 'activation_function': 'relu', 'word_embedding_dimension': 30522})
+)
+```
+## More Information
+The training data for this model is the same as described in [this article](https://medium.com/neuml/embeddings-for-medical-literature-74dae6abf5e0). See [this article](https://huggingface.co/blog/train-sparse-encoder) for more on the training scripts.

config.json ADDED Viewed

	@@ -0,0 +1,24 @@

+{
+  "architectures": [
+    "BertForMaskedLM"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "classifier_dropout": null,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "layer_norm_eps": 1e-12,
+  "max_position_embeddings": 512,
+  "model_type": "bert",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 12,
+  "pad_token_id": 0,
+  "position_embedding_type": "absolute",
+  "torch_dtype": "float32",
+  "transformers_version": "4.52.4",
+  "type_vocab_size": 2,
+  "use_cache": true,
+  "vocab_size": 30522
+}

config_sentence_transformers.json ADDED Viewed

	@@ -0,0 +1,14 @@

+{
+  "model_type": "SparseEncoder",
+  "__version__": {
+    "sentence_transformers": "5.0.0",
+    "transformers": "4.52.4",
+    "pytorch": "2.6.0+cu124"
+  },
+  "prompts": {
+    "query": "",
+    "document": ""
+  },
+  "default_prompt_name": null,
+  "similarity_fn_name": "dot"
+}

eval/similarity_evaluation_results.csv ADDED Viewed

	@@ -0,0 +1,7 @@

+epoch,steps,cosine_pearson,cosine_spearman,active_dims,sparsity_ratio
+0.15855147373594838,10000,0.9305500683984587,0.8662759073144441,74.58932876586914,0.9975562109702553
+0.31710294747189677,20000,0.9343322626967523,0.8742193615544332,54.046302795410156,0.9982292673220821
+0.4756544212078451,30000,0.9333958913897029,0.8842119466991207,37.624534606933594,0.9987672978636087
+0.6342058949437935,40000,0.9361898560379817,0.8862825807212503,35.35245227813721,0.9988417386711835
+0.7927573686797419,50000,0.9422040027533072,0.8842805139950729,35.53547668457031,0.9988357421962988
+0.9513088424156902,60000,0.9422980731390805,0.8870061609483617,34.0018196105957,0.9988859897906233

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:2ef69c3eca29c7451141d84e9cfe00f908b9ac56e9ff5189ab803ad74b37eb65
+size 438080896

modules.json ADDED Viewed

	@@ -0,0 +1,14 @@

+[
+  {
+    "idx": 0,
+    "name": "0",
+    "path": "",
+    "type": "sentence_transformers.sparse_encoder.models.MLMTransformer"
+  },
+  {
+    "idx": 1,
+    "name": "1",
+    "path": "1_SpladePooling",
+    "type": "sentence_transformers.sparse_encoder.models.SpladePooling"
+  }
+]

sentence_bert_config.json ADDED Viewed

	@@ -0,0 +1,4 @@

+{
+    "max_seq_length": 512,
+    "do_lower_case": false
+}

similarity_evaluation_results.csv ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ epoch,steps,cosine_pearson,cosine_spearman,active_dims,sparsity_ratio
2	+ -1,-1,0.9492338008338339,0.8891274940266801,32.76852989196777,0.998926396373371

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,7 @@

+{
+  "cls_token": "[CLS]",
+  "mask_token": "[MASK]",
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "unk_token": "[UNK]"
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,58 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "4": {
+      "content": "[MASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "clean_up_tokenization_spaces": true,
+  "cls_token": "[CLS]",
+  "do_basic_tokenize": true,
+  "do_lower_case": true,
+  "extra_special_tokens": {},
+  "mask_token": "[MASK]",
+  "model_max_length": 1000000000000000019884624838656,
+  "never_split": null,
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "strip_accents": null,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "BertTokenizer",
+  "unk_token": "[UNK]"
+}

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff