File size: 6,717 Bytes
ad437cc fd8fb17 ad437cc 4ac6fa9 ad437cc fcc8054 ad437cc 4ac6fa9 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 |
---
license: mit
datasets:
- newmindai/RAGTruth-TR
language:
- tr
- en
metrics:
- precision
- recall
- f1
- roc_auc
base_model:
- newmindai/TurkEmbed4STS
pipeline_tag: token-classification
---
# TurkEmbed4STS-HallucinationDetection
## Model Description
**TurkEmbed4STS-HallucinationDetection** is a Turkish hallucination detection model based on the GTE-multilingual architecture, optimized for semantic textual similarity and adapted for hallucination detection. This model is part of the Turk-LettuceDetect suite, specifically designed for detecting hallucinations in Turkish Retrieval-Augmented Generation (RAG) applications.
## Model Details
- **Model Type:** Token-level binary classifier for hallucination detection
- **Base Architecture:** GTE-multilingual-base (TurkEmbed4STS)
- **Language:** Turkish (tr)
- **Training Dataset:** Machine-translated RAGTruth dataset (17,790 training instances)
- **Context Length:** Up to 8,192 tokens
- **Model Size:** ~305M parameters
## Intended Use
### Primary Use Cases
- Hallucination detection in Turkish RAG systems
- Token-level classification of supported vs. hallucinated content
- Stable performance across diverse Turkish text generation tasks
- Applications requiring consistent precision-recall balance
### Supported Tasks
- Question Answering (QA) hallucination detection
- Data-to-text generation verification
- Text summarization fact-checking
## Performance
### Overall Performance (F1-Score)
- **Whole Dataset:** 0.7666
- **Question Answering:** 0.7420
- **Data-to-text Generation:** 0.7797
- **Summarization:** 0.6123
### Key Strengths
- Most consistent performance across all task types
- Stable behavior avoiding extreme precision-recall imbalances
- Good semantic understanding from Turkish fine-tuning
## Training Details
### Training Data
- **Dataset:** Machine-translated RAGTruth benchmark
- **Size:** 17,790 training instances, 2,700 test instances
- **Tasks:** Question answering (MS MARCO), data-to-text (Yelp), summarization (CNN/Daily Mail)
- **Translation Model:** Google Gemma-3-27b-it
### Training Configuration
- **Epochs:** 6
- **Learning Rate:** 1e-5
- **Batch Size:** 4
- **Hardware:** NVIDIA A100 40GB GPU
- **Training Time:** ~2 hours
- **Optimization:** Cross-entropy loss with token masking
### Pre-training Background
- Built on GTE-multilingual-base architecture
- Fine-tuned for NLI and STS tasks
- Optimized for Turkish language understanding
- Fine-tuned specifically for hallucination detection
## Technical Specifications
### Architecture Features
- **Base Model:** GTE-multilingual encoder
- **Specialization:** Turkish semantic textual similarity
- **Maximum Sequence Length:** 8,192 tokens
- **Classification Head:** Binary token-level classifier
- **Embedding Dimension:** Based on GTE-multilingual architecture
### Input Format
```
Input: [CONTEXT] [QUESTION] [GENERATED_ANSWER]
Output: Token-level binary labels (0=supported, 1=hallucinated)
```
## Limitations and Biases
### Known Limitations
- Lower performance on summarization tasks compared to structured tasks
- Performance dependent on translation quality of training data
- Smaller model size may limit complex reasoning capabilities
- Optimized for Turkish but built on multilingual foundation
### Potential Biases
- Translation artifacts from machine-translated training data
- Bias toward semantic similarity patterns from STS pre-training
- May favor shorter, more structured text over longer abstracts
## Usage
### Installation
```bash
pip install lettucedetect
```
### Basic Usage
```python
from lettucedetect.models.inference import HallucinationDetector
# Initialize the Turkish-specific hallucination detector
detector = HallucinationDetector(
method="transformer",
model_path="newmindai/TurkEmbed4STS-HD"
)
# Turkish context, question, and answer
context = "İstanbul Türkiye'nin en büyük şehridir. Şehir 15 milyonluk nüfusla Avrupa'nın en kalabalık şehridir."
question = "İstanbul'un nüfusu nedir? İstanbul Avrupa'nın en kalabalık şehri midir?"
answer = "İstanbul'un nüfusu yaklaşık 16 milyondur ve Avrupa'nın en kalabalık şehridir."
# Get span-level predictions (start/end indices, confidence scores)
predictions = detector.predict(
context=context,
question=question,
answer=answer,
output_format="spans"
)
print("Tespit Edilen Hallusinasyonlar:", predictions)
# Örnek çıktı:
# [{'start': 34, 'end': 57, 'confidence': 0.92, 'text': 'yaklaşık 16 milyondur'}]
```
## Evaluation
### Benchmark Results
Evaluated on machine-translated Turkish RAGTruth test set, showing the most consistent behavior across all three task types with stable precision-recall balance.
**Example-level Results**
<img
src="https://cdn-uploads.huggingface.co/production/uploads/683d4880e639f8d647355997/RejTWu3JNjH8t0teV1Txf.png"
width="1000"
style="object-fit: contain; margin: auto; display: block;"
/>
**Token-level Results**
<img
src="https://cdn-uploads.huggingface.co/production/uploads/683d4880e639f8d647355997/ECyrfN5Jv8fZSM0svxLXq.png"
width="500"
style="object-fit: contain; margin: auto; display: block;"
/>
### Comparative Analysis
- Most stable performance across diverse tasks
- Consistent precision-recall balance (unlike models with extreme values)
- Suitable for applications prioritizing reliability over peak performance
## Citation
```bibtex
@inproceedings{turklettucedetect2025,
title={Turk-LettuceDetect: A Hallucination Detection Models for Turkish RAG Applications},
author={NewMind AI Team},
booktitle={9th International Artificial Intelligence and Data Processing Symposium (IDAP'25)},
year={2025},
address={Malatya, Turkey}
}
```
## Related Work
This model builds upon the TurkEmbed4STS model:
```bibtex
@article{turkembed4sts,
title={TurkEmbed4Retrieval: Turkish Embedding Model for Retrieval Task},
author={Ezerceli, Ö. and Gümüşçekicci, G. and Erkoç, T. and Özenc, B.},
journal={preprint},
year={2024}
}
```
## Original LettuceDetect Framework
This model extends the LettuceDetect methodology:
```bibtex
@misc{Kovacs:2025,
title={LettuceDetect: A Hallucination Detection Framework for RAG Applications},
author={Ádám Kovács and Gábor Recski},
year={2025},
eprint={2502.17125},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.17125},
}
```
## License
This model is released under an open-source license to support research and development in Turkish NLP applications.
## Contact
For questions about this model or other Turkish hallucination detection models, please refer to the original paper or contact the authors.
--- |