OGBert-2M-Sentence

A tiny (2.1M parameter) ModernBERT-based sentence embedding model for glossary and domain-specific text.

Related models:

Model Details

Property Value
Architecture ModernBERT + Mean Pooling + L2 Normalize
Parameters 2.1M
Hidden size 128
Layers 4
Attention heads 4
Vocab size 8,192
Max sequence 1,024 tokens
Embedding dim 128 (L2 normalized)

Training

  • Pretraining: Masked Language Modeling on domain-specific glossary corpus
  • Dataset: mjbommar/ogbert-v1-mlm - derived from OpenGloss, a synthetic encyclopedic dictionary with 537K senses across 150K lexemes
  • Key finding: L2 normalization of embeddings is critical for clustering/retrieval performance

Performance

Semantic Textual Similarity (MTEB STS)

Spearman correlation between model similarity scores and human judgments on sentence pairs.

Task OGBert-2M BERT-base RoBERTa-base
STSBenchmark 0.453 0.473 0.545
BIOSSES 0.489 0.547 0.582
STS12 0.396 0.309 0.321
STS13 0.460 0.599 0.563
STS14 0.388 0.477 0.452
STS15 0.500 0.603 0.613
STS16 0.474 0.637 0.620
Average 0.451 0.521 0.528

OGBert-2M achieves 87% of BERT-base STS performance with 52x fewer parameters. Outperforms both baselines on STS12.

Document Clustering (ARI)

Evaluated on 80 domain-specific documents across 10 categories using Spherical KMeans.

Model Params ARI
OGBert-2M-Sentence 2.1M 0.797
BERT-base 110M 0.896
RoBERTa-base 125M 0.941

Document Retrieval (MRR)

Mean Reciprocal Rank for same-category document retrieval.

Model Params MRR P@1
OGBert-2M-Sentence 2.1M 0.973 0.963
BERT-base 110M 0.994 -
RoBERTa-base 125M 0.989 -

Summary vs Baselines

At 1/50th the size, OGBert-2M-Sentence achieves:

  • 87% of BERT-base STS (with STS12 win)
  • 89% of BERT-base clustering (ARI)
  • 98% of BERT-base retrieval (MRR)

Usage

Sentence-Transformers (Recommended)

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('mjbommar/ogbert-2m-sentence')
embeddings = model.encode(['your text here'])  # L2 normalized by default

Example - Domain Similarity:

sentences = [
    'The financial audit revealed discrepancies in the quarterly report.',
    'An accounting review found errors in the fiscal statement.',
    'The patient was diagnosed with acute respiratory infection.',
]
embeddings = model.encode(sentences)
Pair Similarity
Financial [0] โ†” Financial [1] 0.915
Medical [2] โ†” Financial [0] 0.874
Medical [2] โ†” Financial [1] 0.808

The model correctly identifies higher similarity within the financial domain.

Direct Transformers Usage

from transformers import AutoModel, AutoTokenizer
import torch.nn.functional as F

tokenizer = AutoTokenizer.from_pretrained('mjbommar/ogbert-2m-sentence')
model = AutoModel.from_pretrained('mjbommar/ogbert-2m-sentence')

inputs = tokenizer('your text here', return_tensors='pt', padding=True, truncation=True)
outputs = model(**inputs)

# Mean pooling + L2 normalize (critical for performance)
mask = inputs['attention_mask'].unsqueeze(-1)
pooled = (outputs.last_hidden_state * mask).sum(1) / mask.sum(1)
embeddings = F.normalize(pooled, p=2, dim=1)

For Fill-Mask Tasks

Use mjbommar/ogbert-2m-base instead.

Citation

If you use this model, please cite the OpenGloss dataset:

@article{bommarito2025opengloss,
  title={OpenGloss: A Synthetic Encyclopedic Dictionary and Semantic Knowledge Graph},
  author={Bommarito II, Michael J.},
  journal={arXiv preprint arXiv:2511.18622},
  year={2025}
}

License

Apache 2.0

Downloads last month
42
Safetensors
Model size
2.12M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Dataset used to train mjbommar/ogbert-2m-sentence

Evaluation results