Models That Know How Evaluations Are Designed Score Safer
Paper • 2605.28591 • Published • 9
2026 arXiv preprint. Models fine-tuned on documents describing typical evaluation traits show safer behavior by having increased refusal rates and low