--- license: mit base_model: - thomas-sounack/BioClinical-ModernBERT-base tags: - sentence-transformers - sentence-similarity - medical - clinical - biomedical - pubmed - healthcare - medical-ai - clinical-nlp - bioinformatics - medical-literature - clinical-text --- # Clinical ModernBERT Embedding Model A specialized medical embedding model fine-tuned from Clinical ModernBERT using InfoNCE contrastive learning on PubMed title-abstract pairs. ## Model Details - **Base Model**: thomas-sounack/BioClinical-ModernBERT-base - **Training Method**: InfoNCE contrastive learning - **Training Data**: PubMed title-abstract pairs - **Max Sequence Length**: 2048 tokens ## Usage ```python from sentence_transformers import SentenceTransformer # Load the model model = SentenceTransformer("lokeshch19/ModernPubMedBERT") # Encode medical texts texts = [ "Rheumatoid arthritis is an autoimmune disorder attacking joint linings.", "Inflammatory cytokines in RA lead to progressive cartilage and bone destruction." ] embeddings = model.encode(texts) ``` ## Applications - Medical document similarity analysis - Clinical text retrieval systems - Biomedical literature search - Medical concept matching and classification ## Model Comparison Compared to `NeuML/bioclinical-modernbert-base-embeddings`, our model demonstrates superior understanding of medical concepts and enhanced discrimination of non-medical content. ### Comprehensive Evaluation Results | Metric | Our Model | NeuML Model | Improvement | |--------|-----------|-------------|-------------| | **Accuracy@1** | 91.28% | 85.86% | +6.3% | | **Accuracy@3** | 98.46% | 95.66% | +2.9% | | **Accuracy@5** | 99.24% | 97.14% | +2.2% | | **Accuracy@10** | 99.64% | 98.29% | +1.4% | | **NDCG@5** | 95.96% | 92.37% | +3.9% | | **NDCG@10** | 96.10% | 92.75% | +3.6% | | **MRR@10** | 94.89% | 90.90% | +4.4% | | **MAP@100** | 94.91% | 90.96% | +4.3% | *Evaluation performed using `InformationRetrievalEvaluator` from sentence-transformers on the `gamino/wiki_medical_terms` dataset.* ## Model Comparison Compared to `NeuML/bioclinical-modernbert-base-embeddings`, this model demonstrates superior understanding of medical concepts and enhanced discrimination of non-medical content. ### Medical Text Similarity **Example 1: Related Medical Concepts** ```python text1 = "Hypertension increases the risk of stroke and heart attack." text2 = "High blood pressure damages arterial walls over time, leading to cardiovascular events." # Cosine Similarity Results: # Our Model: 0.5941 (59.4%) # NeuML Model: 0.5267 (52.7%) # Improvement: +12.7% ``` ### Non-Medical Text Discrimination **Example 2: Medical vs. Programming Terms** ```python texts = ["diabetes type 2", "asyncio.run()"] # Cosine Similarity Results: # Our Model: 0.0804 (8.0%) - Correctly identifies low similarity # NeuML Model: 0.1926 (19.3%) - Higher false similarity # Better Discrimination: 58% lower false positive rate ``` ### Key Advantages - **Enhanced Medical Understanding**: 12.7% better similarity detection for related medical concepts - **Improved Discrimination**: 58% reduction in false similarities between medical and non-medical terms - **Domain Specialization**: Fine-tuned specifically on PubMed literature for optimal medical text processing ## Training Details - **Optimizer**: AdamW (learning rate: 3e-4, weight decay: 0.1) - **Batch Size**: 72 - **Training Steps**: 7,000 - **Warmup Steps**: 700 ## Citation If you use this model, please cite the base model paper and acknowledge this fine-tuning work.