OGBert-2M-Sentence
A tiny (2.1M parameter) ModernBERT-based sentence embedding model for glossary and domain-specific text.
Related models:
- mjbommar/ogbert-2m-base - Base MLM model for fill-mask tasks
Model Details
| Property | Value |
|---|---|
| Architecture | ModernBERT + Mean Pooling + L2 Normalize |
| Parameters | 2.1M |
| Hidden size | 128 |
| Layers | 4 |
| Attention heads | 4 |
| Vocab size | 8,192 |
| Max sequence | 1,024 tokens |
| Embedding dim | 128 (L2 normalized) |
Training
- Pretraining: Masked Language Modeling on domain-specific glossary corpus
- Dataset: mjbommar/ogbert-v1-mlm - derived from OpenGloss, a synthetic encyclopedic dictionary with 537K senses across 150K lexemes
- Key finding: L2 normalization of embeddings is critical for clustering/retrieval performance
Performance
Semantic Textual Similarity (MTEB STS)
Spearman correlation between model similarity scores and human judgments on sentence pairs.
| Task | OGBert-2M | BERT-base | RoBERTa-base |
|---|---|---|---|
| STSBenchmark | 0.453 | 0.473 | 0.545 |
| BIOSSES | 0.489 | 0.547 | 0.582 |
| STS12 | 0.396 | 0.309 | 0.321 |
| STS13 | 0.460 | 0.599 | 0.563 |
| STS14 | 0.388 | 0.477 | 0.452 |
| STS15 | 0.500 | 0.603 | 0.613 |
| STS16 | 0.474 | 0.637 | 0.620 |
| Average | 0.451 | 0.521 | 0.528 |
OGBert-2M achieves 87% of BERT-base STS performance with 52x fewer parameters. Outperforms both baselines on STS12.
Document Clustering (ARI)
Evaluated on 80 domain-specific documents across 10 categories using Spherical KMeans.
| Model | Params | ARI |
|---|---|---|
| OGBert-2M-Sentence | 2.1M | 0.797 |
| BERT-base | 110M | 0.896 |
| RoBERTa-base | 125M | 0.941 |
Document Retrieval (MRR)
Mean Reciprocal Rank for same-category document retrieval.
| Model | Params | MRR | P@1 |
|---|---|---|---|
| OGBert-2M-Sentence | 2.1M | 0.973 | 0.963 |
| BERT-base | 110M | 0.994 | - |
| RoBERTa-base | 125M | 0.989 | - |
Summary vs Baselines
At 1/50th the size, OGBert-2M-Sentence achieves:
- 87% of BERT-base STS (with STS12 win)
- 89% of BERT-base clustering (ARI)
- 98% of BERT-base retrieval (MRR)
Usage
Sentence-Transformers (Recommended)
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('mjbommar/ogbert-2m-sentence')
embeddings = model.encode(['your text here']) # L2 normalized by default
Example - Domain Similarity:
sentences = [
'The financial audit revealed discrepancies in the quarterly report.',
'An accounting review found errors in the fiscal statement.',
'The patient was diagnosed with acute respiratory infection.',
]
embeddings = model.encode(sentences)
| Pair | Similarity |
|---|---|
| Financial [0] โ Financial [1] | 0.915 |
| Medical [2] โ Financial [0] | 0.874 |
| Medical [2] โ Financial [1] | 0.808 |
The model correctly identifies higher similarity within the financial domain.
Direct Transformers Usage
from transformers import AutoModel, AutoTokenizer
import torch.nn.functional as F
tokenizer = AutoTokenizer.from_pretrained('mjbommar/ogbert-2m-sentence')
model = AutoModel.from_pretrained('mjbommar/ogbert-2m-sentence')
inputs = tokenizer('your text here', return_tensors='pt', padding=True, truncation=True)
outputs = model(**inputs)
# Mean pooling + L2 normalize (critical for performance)
mask = inputs['attention_mask'].unsqueeze(-1)
pooled = (outputs.last_hidden_state * mask).sum(1) / mask.sum(1)
embeddings = F.normalize(pooled, p=2, dim=1)
For Fill-Mask Tasks
Use mjbommar/ogbert-2m-base instead.
Citation
If you use this model, please cite the OpenGloss dataset:
@article{bommarito2025opengloss,
title={OpenGloss: A Synthetic Encyclopedic Dictionary and Semantic Knowledge Graph},
author={Bommarito II, Michael J.},
journal={arXiv preprint arXiv:2511.18622},
year={2025}
}
License
Apache 2.0
- Downloads last month
- 42
Dataset used to train mjbommar/ogbert-2m-sentence
Evaluation results
- spearman_cosine on MTEB STSBenchmarkself-reported0.453
- spearman_cosine on MTEB STS12self-reported0.396