stereoplegic 's Collections Benchmark
updated
KITAB: Evaluating LLMs on Constraint Satisfaction for Information
Retrieval
Paper
• 2310.15511
• Published
• 5
HallusionBench: You See What You Think? Or You Think What You See? An
Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5,
and Other Multi-modality Models
Paper
• 2310.14566
• Published
• 27
SmartPlay : A Benchmark for LLMs as Intelligent Agents
Paper
• 2310.01557
• Published
• 13
FreshLLMs: Refreshing Large Language Models with Search Engine
Augmentation
Paper
• 2310.03214
• Published
• 20
TiC-CLIP: Continual Training of CLIP Models
Paper
• 2310.16226
• Published
• 10
CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code
Completion
Paper
• 2310.11248
• Published
• 4
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Paper
• 2310.06770
• Published
• 9
LongBench: A Bilingual, Multitask Benchmark for Long Context
Understanding
Paper
• 2308.14508
• Published
• 2
JudgeLM: Fine-tuned Large Language Models are Scalable Judges
Paper
• 2310.17631
• Published
• 35
L-Eval: Instituting Standardized Evaluation for Long Context Language
Models
Paper
• 2307.11088
• Published
• 5
Evaluating Instruction-Tuned Large Language Models on Code Comprehension
and Generation
Paper
• 2308.01240
• Published
• 1
ALERT: Adapting Language Models to Reasoning Tasks
Paper
• 2212.08286
• Published
• 2
AGIBench: A Multi-granularity, Multimodal, Human-referenced,
Auto-scoring Benchmark for Large Language Models
Paper
• 2309.06495
• Published
• 1
RAGAS: Automated Evaluation of Retrieval Augmented Generation
Paper
• 2309.15217
• Published
• 4
EvalCrafter: Benchmarking and Evaluating Large Video Generation Models
Paper
• 2310.11440
• Published
• 17
Benchmarking Large Language Models in Retrieval-Augmented Generation
Paper
• 2309.01431
• Published
• 1
PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning
Optimization
Paper
• 2306.05087
• Published
• 7
INSTRUCTEVAL: Towards Holistic Evaluation of Instruction-Tuned Large
Language Models
Paper
• 2306.04757
• Published
• 5
PromptBench: Towards Evaluating the Robustness of Large Language Models
on Adversarial Prompts
Paper
• 2306.04528
• Published
• 3
PIXIU: A Large Language Model, Instruction Data and Evaluation Benchmark
for Finance
Paper
• 2306.05443
• Published
• 3
ClassEval: A Manually-Crafted Benchmark for Evaluating LLMs on
Class-level Code Generation
Paper
• 2308.01861
• Published
• 1
Out of the BLEU: how should we assess quality of the Code Generation
models?
Paper
• 2208.03133
• Published
• 2
CodeApex: A Bilingual Programming Evaluation Benchmark for Large
Language Models
Paper
• 2309.01940
• Published
• 2
COPEN: Probing Conceptual Knowledge in Pre-trained Language Models
Paper
• 2211.04079
• Published
• 1
Benchmarking Language Models for Code Syntax Understanding
Paper
• 2210.14473
• Published
• 1
BigIssue: A Realistic Bug Localization Benchmark
Paper
• 2207.10739
• Published
• 1
CCT-Code: Cross-Consistency Training for Multilingual Clone Detection
and Code Search
Paper
• 2305.11626
• Published
• 1
AutoMLBench: A Comprehensive Experimental Evaluation of Automated
Machine Learning Frameworks
Paper
• 2204.08358
• Published
• 1
Continual evaluation for lifelong learning: Identifying the stability
gap
Paper
• 2205.13452
• Published
• 1
MEGAVERSE: Benchmarking Large Language Models Across Languages,
Modalities, Models and Tasks
Paper
• 2311.07463
• Published
• 15
RaLLe: A Framework for Developing and Evaluating Retrieval-Augmented
Large Language Models
Paper
• 2308.10633
• Published
• 1
Fake Alignment: Are LLMs Really Aligned Well?
Paper
• 2311.05915
• Published
• 2
ToolTalk: Evaluating Tool-Usage in a Conversational Setting
Paper
• 2311.10775
• Published
• 9
MetaTool Benchmark for Large Language Models: Deciding Whether to Use
Tools and Which to Use
Paper
• 2310.03128
• Published
• 1
RETA-LLM: A Retrieval-Augmented Large Language Model Toolkit
Paper
• 2306.05212
• Published
• 1
GPQA: A Graduate-Level Google-Proof Q&A Benchmark
Paper
• 2311.12022
• Published
• 35
GAIA: a benchmark for General AI Assistants
Paper
• 2311.12983
• Published
• 245
ML-Bench: Large Language Models Leverage Open-source Libraries for
Machine Learning Tasks
Paper
• 2311.09835
• Published
• 11
CRUXEval: A Benchmark for Code Reasoning, Understanding and Execution
Paper
• 2401.03065
• Published
• 11
CodeFuse-13B: A Pretrained Multi-lingual Code Large Language Model
Paper
• 2310.06266
• Published
• 2
ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in
Large Multimodal Models
Paper
• 2401.13311
• Published
• 12
Text2KGBench: A Benchmark for Ontology-Driven Knowledge Graph Generation
from Text
Paper
• 2308.02357
• Published
• 2
Have LLMs Advanced Enough? A Challenging Problem Solving Benchmark For
Large Language Models
Paper
• 2305.15074
• Published
• 1
Copilot Evaluation Harness: Evaluating LLM-Guided Software Programming
Paper
• 2402.14261
• Published
• 10
The FinBen: An Holistic Financial Benchmark for Large Language Models
Paper
• 2402.12659
• Published
• 23
Multi-Task Inference: Can Large Language Models Follow Multiple
Instructions at Once?
Paper
• 2402.11597
• Published
ToolEyes: Fine-Grained Evaluation for Tool Learning Capabilities of
Large Language Models in Real-world Scenarios
Paper
• 2401.00741
• Published
• 1
ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic
Scientific Workflows
Paper
• 2505.19897
• Published
• 104
MCP-Universe: Benchmarking Large Language Models with Real-World Model
Context Protocol Servers
Paper
• 2508.14704
• Published
• 43