File size: 3,563 Bytes
913b848 810bc50 913b848 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 |
---
license: mit
base_model:
- thomas-sounack/BioClinical-ModernBERT-base
tags:
- sentence-transformers
- sentence-similarity
- medical
- clinical
- biomedical
- pubmed
- healthcare
- medical-ai
- clinical-nlp
- bioinformatics
- medical-literature
- clinical-text
---
# Clinical ModernBERT Embedding Model
A specialized medical embedding model fine-tuned from Clinical ModernBERT using InfoNCE contrastive learning on PubMed title-abstract pairs.
## Model Details
- **Base Model**: thomas-sounack/BioClinical-ModernBERT-base
- **Training Method**: InfoNCE contrastive learning
- **Training Data**: PubMed title-abstract pairs
- **Max Sequence Length**: 2048 tokens
## Usage
```python
from sentence_transformers import SentenceTransformer
# Load the model
model = SentenceTransformer("lokeshch19/ModernPubMedBERT")
# Encode medical texts
texts = [
"Rheumatoid arthritis is an autoimmune disorder attacking joint linings.",
"Inflammatory cytokines in RA lead to progressive cartilage and bone destruction."
]
embeddings = model.encode(texts)
```
## Applications
- Medical document similarity analysis
- Clinical text retrieval systems
- Biomedical literature search
- Medical concept matching and classification
## Model Comparison
Compared to `NeuML/bioclinical-modernbert-base-embeddings`, our model demonstrates superior understanding of medical concepts and enhanced discrimination of non-medical content.
### Comprehensive Evaluation Results
| Metric | Our Model | NeuML Model | Improvement |
|--------|-----------|-------------|-------------|
| **Accuracy@1** | 91.28% | 85.86% | +6.3% |
| **Accuracy@3** | 98.46% | 95.66% | +2.9% |
| **Accuracy@5** | 99.24% | 97.14% | +2.2% |
| **Accuracy@10** | 99.64% | 98.29% | +1.4% |
| **NDCG@5** | 95.96% | 92.37% | +3.9% |
| **NDCG@10** | 96.10% | 92.75% | +3.6% |
| **MRR@10** | 94.89% | 90.90% | +4.4% |
| **MAP@100** | 94.91% | 90.96% | +4.3% |
*Evaluation performed using `InformationRetrievalEvaluator` from sentence-transformers on the `gamino/wiki_medical_terms` dataset.*
## Model Comparison
Compared to `NeuML/bioclinical-modernbert-base-embeddings`, this model demonstrates superior understanding of medical concepts and enhanced discrimination of non-medical content.
### Medical Text Similarity
**Example 1: Related Medical Concepts**
```python
text1 = "Hypertension increases the risk of stroke and heart attack."
text2 = "High blood pressure damages arterial walls over time, leading to cardiovascular events."
# Cosine Similarity Results:
# Our Model: 0.5941 (59.4%)
# NeuML Model: 0.5267 (52.7%)
# Improvement: +12.7%
```
### Non-Medical Text Discrimination
**Example 2: Medical vs. Programming Terms**
```python
texts = ["diabetes type 2", "asyncio.run()"]
# Cosine Similarity Results:
# Our Model: 0.0804 (8.0%) - Correctly identifies low similarity
# NeuML Model: 0.1926 (19.3%) - Higher false similarity
# Better Discrimination: 58% lower false positive rate
```
### Key Advantages
- **Enhanced Medical Understanding**: 12.7% better similarity detection for related medical concepts
- **Improved Discrimination**: 58% reduction in false similarities between medical and non-medical terms
- **Domain Specialization**: Fine-tuned specifically on PubMed literature for optimal medical text processing
## Training Details
- **Optimizer**: AdamW (learning rate: 3e-4, weight decay: 0.1)
- **Batch Size**: 72
- **Training Steps**: 7,000
- **Warmup Steps**: 700
## Citation
If you use this model, please cite the base model paper and acknowledge this fine-tuning work. |