pinned
Running
157
GIFT Eval
🥇
GIFT-Eval: A Benchmark for General Time Series Forecasting
None defined yet.
LoCoBench-Agent: An Interactive Benchmark for LLM Agents in Long-Context Software Engineering
MMPersuade: A Dataset and Evaluation Framework for Multimodal Persuasion
GIFT-Eval: A Benchmark for General Time Series Forecasting
A realistic benchmark with real CRM tasks for LLM agents.
View and submit LLM benchmark evaluations
Filter and view LLM benchmark data
Explore efficient reasoning techniques with large language models
Generate captions and chat about images