ExpertLongBench: Benchmarking Language Models on Expert-Level Long-Form Generation Tasks with Structured Checklists Paper • 2506.01241 • Published Jun 2 • 9
FactBench: A Dynamic Benchmark for In-the-Wild Language Model Factuality Evaluation Paper • 2410.22257 • Published Oct 29, 2024