Lighteval documentation

Evaluate your model with Inspect-AI

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v0.13.0).
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Evaluate your model with Inspect-AI

Pick the right benchmarks with our benchmark finder: Search by language, task type, dataset name, or keywords.

Not all tasks are compatible with inspect-ai’s API as of yet, we are working on converting all of them !

Once you’ve chosen a benchmark, run it with lighteval eval. Below are examples for common setups.

Examples

  1. Evaluate a model via Hugging Face Inference Providers.
lighteval eval "hf-inference-providers/openai/gpt-oss-20b" gpqa:diamond
  1. Run multiple evals at the same time.
lighteval eval "hf-inference-providers/openai/gpt-oss-20b" gpqa:diamond,aime25
  1. Compare providers for the same model.
lighteval eval \
    hf-inference-providers/openai/gpt-oss-20b:fireworks-ai \
    hf-inference-providers/openai/gpt-oss-20b:together \
    hf-inference-providers/openai/gpt-oss-20b:nebius \
    gpqa:diamond

You can also compare every providers serving one model in one line:

    hf-inference-providers/openai/gpt-oss-20b:all \
    "lighteval|gpqa:diamond|0"
  1. Evaluate a vLLM or SGLang model.
lighteval eval vllm/HuggingFaceTB/SmolLM-135M-Instruct gpqa:diamond
  1. See the impact of few-shot on your model.
lighteval eval hf-inference-providers/openai/gpt-oss-20b "gsm8k|0,gsm8k|5"
  1. Optimize custom server connections.
lighteval eval hf-inference-providers/openai/gpt-oss-20b gsm8k \
    --max-connections 50 \
    --timeout 30 \
    --retry-on-error 1 \
    --max-retries 1 \
    --max-samples 10
  1. Use multiple epochs for more reliable results.
lighteval eval hf-inference-providers/openai/gpt-oss-20b aime25 --epochs 16 --epochs-reducer "pass_at_4"
  1. Push to the Hub to share results.
lighteval eval hf-inference-providers/openai/gpt-oss-20b hle \
    --bundle-dir gpt-oss-bundle \
    --repo-id OpenEvals/evals \
    --max-samples 100

Resulting Space:

  1. Change model behaviour

You can use any argument defined in inspect-ai’s API.

lighteval eval hf-inference-providers/openai/gpt-oss-20b aime25 --temperature 0.1
  1. Use model-args to use any inference provider specific argument.
lighteval eval google/gemini-2.5-pro aime25 --model-args location=us-east5
lighteval eval openai/gpt-4o gpqa:diamond --model-args service_tier=flex,client_timeout=1200

LightEval prints a per-model results table:

Completed all tasks in 'lighteval-logs' successfully

|                 Model                 |gpqa|gpqa:diamond|
|---------------------------------------|---:|-----------:|
|vllm/HuggingFaceTB/SmolLM-135M-Instruct|0.01|        0.01|

results saved to lighteval-logs
run "inspect view --log-dir lighteval-logs" to view the results
Update on GitHub