OGBert-2M-Sentence

A tiny (2.1M parameter) ModernBERT-based sentence embedding model for glossary and domain-specific text.

Related models:

mjbommar/ogbert-2m-base - Base MLM model for fill-mask tasks

Model Details

Property	Value
Architecture	ModernBERT + Mean Pooling + L2 Normalize
Parameters	2.1M
Hidden size	128
Layers	4
Attention heads	4
Vocab size	8,192
Max sequence	1,024 tokens
Embedding dim	128 (L2 normalized)

Training

Pretraining: Masked Language Modeling on domain-specific glossary corpus
Dataset: mjbommar/ogbert-v1-mlm - derived from OpenGloss, a synthetic encyclopedic dictionary with 537K senses across 150K lexemes
Key finding: L2 normalization of embeddings is critical for clustering/retrieval performance

Performance

Semantic Textual Similarity (MTEB STS)

Spearman correlation between model similarity scores and human judgments on sentence pairs.

Task	OGBert-2M	BERT-base	RoBERTa-base
STSBenchmark	0.453	0.473	0.545
BIOSSES	0.489	0.547	0.582
STS12	0.396	0.309	0.321
STS13	0.460	0.599	0.563
STS14	0.388	0.477	0.452
STS15	0.500	0.603	0.613
STS16	0.474	0.637	0.620
Average	0.451	0.521	0.528

OGBert-2M achieves 87% of BERT-base STS performance with 52x fewer parameters. Outperforms both baselines on STS12.

Document Clustering (ARI)

Evaluated on 80 domain-specific documents across 10 categories using Spherical KMeans.

Model	Params	ARI
OGBert-2M-Sentence	2.1M	0.797
BERT-base	110M	0.896
RoBERTa-base	125M	0.941

Document Retrieval (MRR)

Mean Reciprocal Rank for same-category document retrieval.

Model	Params	MRR	P@1
OGBert-2M-Sentence	2.1M	0.973	0.963
BERT-base	110M	0.994	-
RoBERTa-base	125M	0.989	-

Summary vs Baselines

At 1/50th the size, OGBert-2M-Sentence achieves:

87% of BERT-base STS (with STS12 win)
89% of BERT-base clustering (ARI)
98% of BERT-base retrieval (MRR)

Usage

Sentence-Transformers (Recommended)

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('mjbommar/ogbert-2m-sentence')
embeddings = model.encode(['your text here'])  # L2 normalized by default

Example - Domain Similarity:

sentences = [
    'The financial audit revealed discrepancies in the quarterly report.',
    'An accounting review found errors in the fiscal statement.',
    'The patient was diagnosed with acute respiratory infection.',
]
embeddings = model.encode(sentences)

Pair	Similarity
Financial [0] ↔ Financial [1]	0.915
Medical [2] ↔ Financial [0]	0.874
Medical [2] ↔ Financial [1]	0.808

The model correctly identifies higher similarity within the financial domain.

Direct Transformers Usage

from transformers import AutoModel, AutoTokenizer
import torch.nn.functional as F

tokenizer = AutoTokenizer.from_pretrained('mjbommar/ogbert-2m-sentence')
model = AutoModel.from_pretrained('mjbommar/ogbert-2m-sentence')

inputs = tokenizer('your text here', return_tensors='pt', padding=True, truncation=True)
outputs = model(**inputs)

# Mean pooling + L2 normalize (critical for performance)
mask = inputs['attention_mask'].unsqueeze(-1)
pooled = (outputs.last_hidden_state * mask).sum(1) / mask.sum(1)
embeddings = F.normalize(pooled, p=2, dim=1)

For Fill-Mask Tasks

Use mjbommar/ogbert-2m-base instead.

Citation

If you use this model, please cite the OpenGloss dataset:

@article{bommarito2025opengloss,
  title={OpenGloss: A Synthetic Encyclopedic Dictionary and Semantic Knowledge Graph},
  author={Bommarito II, Michael J.},
  journal={arXiv preprint arXiv:2511.18622},
  year={2025}
}

License

Apache 2.0

Downloads last month: 42

Safetensors

Model size

2.12M params

Tensor type

F32

Dataset used to train mjbommar/ogbert-2m-sentence

Evaluation results

spearman_cosine on MTEB STSBenchmark
self-reported

0.453
spearman_cosine on MTEB STS12
self-reported

0.396