|
|
--- |
|
|
license: mit |
|
|
datasets: |
|
|
- newmindai/RAGTruth-TR |
|
|
language: |
|
|
- tr |
|
|
- en |
|
|
metrics: |
|
|
- precision |
|
|
- recall |
|
|
- f1 |
|
|
- roc_auc |
|
|
base_model: |
|
|
- newmindai/TurkEmbed4STS |
|
|
pipeline_tag: token-classification |
|
|
--- |
|
|
|
|
|
# TurkEmbed4STS-HallucinationDetection |
|
|
|
|
|
## Model Description |
|
|
|
|
|
**TurkEmbed4STS-HallucinationDetection** is a Turkish hallucination detection model based on the GTE-multilingual architecture, optimized for semantic textual similarity and adapted for hallucination detection. This model is part of the Turk-LettuceDetect suite, specifically designed for detecting hallucinations in Turkish Retrieval-Augmented Generation (RAG) applications. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Model Type:** Token-level binary classifier for hallucination detection |
|
|
- **Base Architecture:** GTE-multilingual-base (TurkEmbed4STS) |
|
|
- **Language:** Turkish (tr) |
|
|
- **Training Dataset:** Machine-translated RAGTruth dataset (17,790 training instances) |
|
|
- **Context Length:** Up to 8,192 tokens |
|
|
- **Model Size:** ~305M parameters |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
### Primary Use Cases |
|
|
- Hallucination detection in Turkish RAG systems |
|
|
- Token-level classification of supported vs. hallucinated content |
|
|
- Stable performance across diverse Turkish text generation tasks |
|
|
- Applications requiring consistent precision-recall balance |
|
|
|
|
|
### Supported Tasks |
|
|
- Question Answering (QA) hallucination detection |
|
|
- Data-to-text generation verification |
|
|
- Text summarization fact-checking |
|
|
|
|
|
## Performance |
|
|
|
|
|
### Overall Performance (F1-Score) |
|
|
- **Whole Dataset:** 0.7666 |
|
|
- **Question Answering:** 0.7420 |
|
|
- **Data-to-text Generation:** 0.7797 |
|
|
- **Summarization:** 0.6123 |
|
|
|
|
|
### Key Strengths |
|
|
- Most consistent performance across all task types |
|
|
- Stable behavior avoiding extreme precision-recall imbalances |
|
|
- Good semantic understanding from Turkish fine-tuning |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Training Data |
|
|
- **Dataset:** Machine-translated RAGTruth benchmark |
|
|
- **Size:** 17,790 training instances, 2,700 test instances |
|
|
- **Tasks:** Question answering (MS MARCO), data-to-text (Yelp), summarization (CNN/Daily Mail) |
|
|
- **Translation Model:** Google Gemma-3-27b-it |
|
|
|
|
|
### Training Configuration |
|
|
- **Epochs:** 6 |
|
|
- **Learning Rate:** 1e-5 |
|
|
- **Batch Size:** 4 |
|
|
- **Hardware:** NVIDIA A100 40GB GPU |
|
|
- **Training Time:** ~2 hours |
|
|
- **Optimization:** Cross-entropy loss with token masking |
|
|
|
|
|
### Pre-training Background |
|
|
- Built on GTE-multilingual-base architecture |
|
|
- Fine-tuned for NLI and STS tasks |
|
|
- Optimized for Turkish language understanding |
|
|
- Fine-tuned specifically for hallucination detection |
|
|
|
|
|
## Technical Specifications |
|
|
|
|
|
### Architecture Features |
|
|
- **Base Model:** GTE-multilingual encoder |
|
|
- **Specialization:** Turkish semantic textual similarity |
|
|
- **Maximum Sequence Length:** 8,192 tokens |
|
|
- **Classification Head:** Binary token-level classifier |
|
|
- **Embedding Dimension:** Based on GTE-multilingual architecture |
|
|
|
|
|
### Input Format |
|
|
``` |
|
|
Input: [CONTEXT] [QUESTION] [GENERATED_ANSWER] |
|
|
Output: Token-level binary labels (0=supported, 1=hallucinated) |
|
|
``` |
|
|
|
|
|
## Limitations and Biases |
|
|
|
|
|
### Known Limitations |
|
|
- Lower performance on summarization tasks compared to structured tasks |
|
|
- Performance dependent on translation quality of training data |
|
|
- Smaller model size may limit complex reasoning capabilities |
|
|
- Optimized for Turkish but built on multilingual foundation |
|
|
|
|
|
### Potential Biases |
|
|
- Translation artifacts from machine-translated training data |
|
|
- Bias toward semantic similarity patterns from STS pre-training |
|
|
- May favor shorter, more structured text over longer abstracts |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Installation |
|
|
```bash |
|
|
pip install lettucedetect |
|
|
``` |
|
|
|
|
|
### Basic Usage |
|
|
```python |
|
|
from lettucedetect.models.inference import HallucinationDetector |
|
|
|
|
|
# Initialize the Turkish-specific hallucination detector |
|
|
detector = HallucinationDetector( |
|
|
method="transformer", |
|
|
model_path="newmindai/TurkEmbed4STS-HD" |
|
|
) |
|
|
|
|
|
# Turkish context, question, and answer |
|
|
context = "İstanbul Türkiye'nin en büyük şehridir. Şehir 15 milyonluk nüfusla Avrupa'nın en kalabalık şehridir." |
|
|
question = "İstanbul'un nüfusu nedir? İstanbul Avrupa'nın en kalabalık şehri midir?" |
|
|
answer = "İstanbul'un nüfusu yaklaşık 16 milyondur ve Avrupa'nın en kalabalık şehridir." |
|
|
|
|
|
# Get span-level predictions (start/end indices, confidence scores) |
|
|
predictions = detector.predict( |
|
|
context=context, |
|
|
question=question, |
|
|
answer=answer, |
|
|
output_format="spans" |
|
|
) |
|
|
|
|
|
print("Tespit Edilen Hallusinasyonlar:", predictions) |
|
|
# Örnek çıktı: |
|
|
# [{'start': 34, 'end': 57, 'confidence': 0.92, 'text': 'yaklaşık 16 milyondur'}] |
|
|
``` |
|
|
|
|
|
|
|
|
## Evaluation |
|
|
|
|
|
### Benchmark Results |
|
|
Evaluated on machine-translated Turkish RAGTruth test set, showing the most consistent behavior across all three task types with stable precision-recall balance. |
|
|
|
|
|
**Example-level Results** |
|
|
|
|
|
<img |
|
|
src="https://cdn-uploads.huggingface.co/production/uploads/683d4880e639f8d647355997/RejTWu3JNjH8t0teV1Txf.png" |
|
|
width="1000" |
|
|
style="object-fit: contain; margin: auto; display: block;" |
|
|
/> |
|
|
**Token-level Results** |
|
|
|
|
|
<img |
|
|
src="https://cdn-uploads.huggingface.co/production/uploads/683d4880e639f8d647355997/ECyrfN5Jv8fZSM0svxLXq.png" |
|
|
width="500" |
|
|
style="object-fit: contain; margin: auto; display: block;" |
|
|
/> |
|
|
|
|
|
### Comparative Analysis |
|
|
- Most stable performance across diverse tasks |
|
|
- Consistent precision-recall balance (unlike models with extreme values) |
|
|
- Suitable for applications prioritizing reliability over peak performance |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@inproceedings{turklettucedetect2025, |
|
|
title={Turk-LettuceDetect: A Hallucination Detection Models for Turkish RAG Applications}, |
|
|
author={NewMind AI Team}, |
|
|
booktitle={9th International Artificial Intelligence and Data Processing Symposium (IDAP'25)}, |
|
|
year={2025}, |
|
|
address={Malatya, Turkey} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Related Work |
|
|
|
|
|
This model builds upon the TurkEmbed4STS model: |
|
|
```bibtex |
|
|
@article{turkembed4sts, |
|
|
title={TurkEmbed4Retrieval: Turkish Embedding Model for Retrieval Task}, |
|
|
author={Ezerceli, Ö. and Gümüşçekicci, G. and Erkoç, T. and Özenc, B.}, |
|
|
journal={preprint}, |
|
|
year={2024} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Original LettuceDetect Framework |
|
|
|
|
|
This model extends the LettuceDetect methodology: |
|
|
```bibtex |
|
|
@misc{Kovacs:2025, |
|
|
title={LettuceDetect: A Hallucination Detection Framework for RAG Applications}, |
|
|
author={Ádám Kovács and Gábor Recski}, |
|
|
year={2025}, |
|
|
eprint={2502.17125}, |
|
|
archivePrefix={arXiv}, |
|
|
primaryClass={cs.CL}, |
|
|
url={https://arxiv.org/abs/2502.17125}, |
|
|
} |
|
|
``` |
|
|
|
|
|
|
|
|
## License |
|
|
|
|
|
This model is released under an open-source license to support research and development in Turkish NLP applications. |
|
|
|
|
|
## Contact |
|
|
|
|
|
For questions about this model or other Turkish hallucination detection models, please refer to the original paper or contact the authors. |
|
|
|
|
|
--- |