TurkEmbed4STS-HD / README.md

Update README.md

1989f6e verified 4 months ago

6.72 kB

	---
	license: mit
	datasets:
	- newmindai/RAGTruth-TR
	language:
	- tr
	- en
	metrics:
	- precision
	- recall
	- f1
	- roc_auc
	base_model:
	- newmindai/TurkEmbed4STS
	pipeline_tag: token-classification
	---

	# TurkEmbed4STS-HallucinationDetection

	## Model Description

	TurkEmbed4STS-HallucinationDetection is a Turkish hallucination detection model based on the GTE-multilingual architecture, optimized for semantic textual similarity and adapted for hallucination detection. This model is part of the Turk-LettuceDetect suite, specifically designed for detecting hallucinations in Turkish Retrieval-Augmented Generation (RAG) applications.

	## Model Details

	- Model Type: Token-level binary classifier for hallucination detection
	- Base Architecture: GTE-multilingual-base (TurkEmbed4STS)
	- Language: Turkish (tr)
	- Training Dataset: Machine-translated RAGTruth dataset (17,790 training instances)
	- Context Length: Up to 8,192 tokens
	- Model Size: ~305M parameters

	## Intended Use

	### Primary Use Cases
	- Hallucination detection in Turkish RAG systems
	- Token-level classification of supported vs. hallucinated content
	- Stable performance across diverse Turkish text generation tasks
	- Applications requiring consistent precision-recall balance

	### Supported Tasks
	- Question Answering (QA) hallucination detection
	- Data-to-text generation verification
	- Text summarization fact-checking

	## Performance

	### Overall Performance (F1-Score)
	- Whole Dataset: 0.7666
	- Question Answering: 0.7420
	- Data-to-text Generation: 0.7797
	- Summarization: 0.6123

	### Key Strengths
	- Most consistent performance across all task types
	- Stable behavior avoiding extreme precision-recall imbalances
	- Good semantic understanding from Turkish fine-tuning

	## Training Details

	### Training Data
	- Dataset: Machine-translated RAGTruth benchmark
	- Size: 17,790 training instances, 2,700 test instances
	- Tasks: Question answering (MS MARCO), data-to-text (Yelp), summarization (CNN/Daily Mail)
	- Translation Model: Google Gemma-3-27b-it

	### Training Configuration
	- Epochs: 6
	- Learning Rate: 1e-5
	- Batch Size: 4
	- Hardware: NVIDIA A100 40GB GPU
	- Training Time: ~2 hours
	- Optimization: Cross-entropy loss with token masking

	### Pre-training Background
	- Built on GTE-multilingual-base architecture
	- Fine-tuned for NLI and STS tasks
	- Optimized for Turkish language understanding
	- Fine-tuned specifically for hallucination detection

	## Technical Specifications

	### Architecture Features
	- Base Model: GTE-multilingual encoder
	- Specialization: Turkish semantic textual similarity
	- Maximum Sequence Length: 8,192 tokens
	- Classification Head: Binary token-level classifier
	- Embedding Dimension: Based on GTE-multilingual architecture

	### Input Format
	```
	Input: [CONTEXT] [QUESTION] [GENERATED_ANSWER]
	Output: Token-level binary labels (0=supported, 1=hallucinated)
	```

	## Limitations and Biases

	### Known Limitations
	- Lower performance on summarization tasks compared to structured tasks
	- Performance dependent on translation quality of training data
	- Smaller model size may limit complex reasoning capabilities
	- Optimized for Turkish but built on multilingual foundation

	### Potential Biases
	- Translation artifacts from machine-translated training data
	- Bias toward semantic similarity patterns from STS pre-training
	- May favor shorter, more structured text over longer abstracts

	## Usage

	### Installation
	```bash
	pip install lettucedetect
	```

	### Basic Usage
	```python
	from lettucedetect.models.inference import HallucinationDetector

	# Initialize the Turkish-specific hallucination detector
	detector = HallucinationDetector(
	method="transformer",
	model_path="newmindai/TurkEmbed4STS-HD"
	)

	# Turkish context, question, and answer
	context = "İstanbul Türkiye'nin en büyük şehridir. Şehir 15 milyonluk nüfusla Avrupa'nın en kalabalık şehridir."
	question = "İstanbul'un nüfusu nedir? İstanbul Avrupa'nın en kalabalık şehri midir?"
	answer = "İstanbul'un nüfusu yaklaşık 16 milyondur ve Avrupa'nın en kalabalık şehridir."

	# Get span-level predictions (start/end indices, confidence scores)
	predictions = detector.predict(
	context=context,
	question=question,
	answer=answer,
	output_format="spans"
	)

	print("Tespit Edilen Hallusinasyonlar:", predictions)
	# Örnek çıktı:
	# [{'start': 34, 'end': 57, 'confidence': 0.92, 'text': 'yaklaşık 16 milyondur'}]
	```


	## Evaluation

	### Benchmark Results
	Evaluated on machine-translated Turkish RAGTruth test set, showing the most consistent behavior across all three task types with stable precision-recall balance.

	Example-level Results

	<img
	src="https://cdn-uploads.huggingface.co/production/uploads/683d4880e639f8d647355997/RejTWu3JNjH8t0teV1Txf.png"
	width="1000"
	style="object-fit: contain; margin: auto; display: block;"
	/>
	Token-level Results

	<img
	src="https://cdn-uploads.huggingface.co/production/uploads/683d4880e639f8d647355997/ECyrfN5Jv8fZSM0svxLXq.png"
	width="500"
	style="object-fit: contain; margin: auto; display: block;"
	/>

	### Comparative Analysis
	- Most stable performance across diverse tasks
	- Consistent precision-recall balance (unlike models with extreme values)
	- Suitable for applications prioritizing reliability over peak performance

	## Citation

	```bibtex
	@inproceedings{turklettucedetect2025,
	title={Turk-LettuceDetect: A Hallucination Detection Models for Turkish RAG Applications},
	author={NewMind AI Team},
	booktitle={9th International Artificial Intelligence and Data Processing Symposium (IDAP'25)},
	year={2025},
	address={Malatya, Turkey}
	}
	```

	## Related Work

	This model builds upon the TurkEmbed4STS model:
	```bibtex
	@article{turkembed4sts,
	title={TurkEmbed4Retrieval: Turkish Embedding Model for Retrieval Task},
	author={Ezerceli, Ö. and Gümüşçekicci, G. and Erkoç, T. and Özenc, B.},
	journal={preprint},
	year={2024}
	}
	```

	## Original LettuceDetect Framework

	This model extends the LettuceDetect methodology:
	```bibtex
	@misc{Kovacs:2025,
	title={LettuceDetect: A Hallucination Detection Framework for RAG Applications},
	author={Ádám Kovács and Gábor Recski},
	year={2025},
	eprint={2502.17125},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2502.17125},
	}
	```


	## License

	This model is released under an open-source license to support research and development in Turkish NLP applications.

	## Contact

	For questions about this model or other Turkish hallucination detection models, please refer to the original paper or contact the authors.

	---