metadata
tags:
- sentence-transformers
- sentence-similarity
- feature-extraction
- dense
- turkish
- semantic-search
base_model:
- BAAI/bge-m3
- suayptalha/Sungur-9B
pipeline_tag: sentence-similarity
library_name: sentence-transformers
metrics:
- pearson_cosine
- spearman_cosine
model-index:
- name: turkce-embedding-bge-m3
results:
- task:
type: semantic-similarity
name: Semantic Similarity (STS Validation)
dataset:
name: Turkish STS Validation Set
type: sts-validation
metrics:
- type: pearson_cosine
value: 0.9096
name: Pearson Cosine
- type: spearman_cosine
value: 0.6839
name: Spearman Cosine
datasets:
- nezahatkorkmaz/turkce-embedding-eslestirme-ucluler
- nezahatkorkmaz/turkce-embedding-sts-degerlendirme
- nezahatkorkmaz/turkce-embedding-eslestirme-ciftler
language:
- tr
🇹🇷 Turkish Embedding Model (bge-m3 Fine-tuned)
This model is a Turkish fine-tuned version of BAAI/bge-m3, optimized for Turkish semantic similarity, retrieval, and RAG (Retrieval-Augmented Generation) tasks.
It maps Turkish sentences and paragraphs into a 1024-dimensional dense vector space.
Model Overview
| Property | Value |
|---|---|
| Base Model | BAAI/bge-m3 |
| Architecture | XLM-RoBERTa + Pooling + Normalize |
| Embedding Dimension | 1024 |
| Max Sequence Length | 8192 |
| Similarity Function | Cosine |
| Loss Functions | MultipleNegativesRankingLoss + TripletLoss |
| Language | Turkish 🇹🇷 |
| Use Cases | Semantic Search, Text Similarity, RAG, Clustering |
Evaluation Results
Model was evaluated on a Turkish Semantic Textual Similarity (STS) dataset.
Compared to the base multilingual BGE-M3 model, the fine-tuned model shows a notable improvement in Pearson correlation, indicating better alignment between cosine similarity scores and human judgments.
| Metric | Base (BAAI/bge-m3) | Fine-tuned | Δ (Change) |
|---|---|---|---|
| Spearman (ρ) | 0.6814 | 0.6839 | +0.0025 |
| Pearson (r) | 0.8535 | 0.9096 | +0.0561 |
The model demonstrates higher linear correlation on Turkish STS benchmarks, producing more consistent semantic scores for Turkish-language retrieval and ranking tasks.
Quick Example
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer("nezahatkorkmaz/turkce-embedding-bge-m3")
s1 = "Türkiye'nin başkenti Ankara'dır"
s2 = "Ankara Türkiye'nin başşehridir"
emb1, emb2 = model.encode([s1, s2], normalize_embeddings=True)
score = util.cos_sim(emb1, emb2).item()
print(f"Cosine similarity: {score:.4f}")
# Expected output ≈ 0.75–0.80