Lighteval documentation
Evaluate your model with Inspect-AI
Getting started
Guides
Examples using Inspect-AISave and read resultsCachingUse the Python APIAdd a custom taskAdd a custom metricEvaluate a custom modelUse HF's inference providers as backendUse litellm as backendUse vllm as backendUse SGLang as backendUse Hugging Face inference endpoints or TGI as backendContributing to multilingual evaluations
API
Reference
Evaluate your model with Inspect-AI
Pick the right benchmarks with our benchmark finder: Search by language, task type, dataset name, or keywords.
Not all tasks are compatible with inspect-ai’s API as of yet, we are working on converting all of them !
Once you’ve chosen a benchmark, run it with lighteval eval. Below are examples for common setups.
Examples
- Evaluate a model via Hugging Face Inference Providers.
lighteval eval "hf-inference-providers/openai/gpt-oss-20b" gpqa:diamond- Run multiple evals at the same time.
lighteval eval "hf-inference-providers/openai/gpt-oss-20b" gpqa:diamond,aime25- Compare providers for the same model.
lighteval eval \
hf-inference-providers/openai/gpt-oss-20b:fireworks-ai \
hf-inference-providers/openai/gpt-oss-20b:together \
hf-inference-providers/openai/gpt-oss-20b:nebius \
gpqa:diamondYou can also compare every providers serving one model in one line:
hf-inference-providers/openai/gpt-oss-20b:all \
"lighteval|gpqa:diamond|0"- Evaluate a vLLM or SGLang model.
lighteval eval vllm/HuggingFaceTB/SmolLM-135M-Instruct gpqa:diamond- See the impact of few-shot on your model.
lighteval eval hf-inference-providers/openai/gpt-oss-20b "gsm8k|0,gsm8k|5"- Optimize custom server connections.
lighteval eval hf-inference-providers/openai/gpt-oss-20b gsm8k \
--max-connections 50 \
--timeout 30 \
--retry-on-error 1 \
--max-retries 1 \
--max-samples 10- Use multiple epochs for more reliable results.
lighteval eval hf-inference-providers/openai/gpt-oss-20b aime25 --epochs 16 --epochs-reducer "pass_at_4"- Push to the Hub to share results.
lighteval eval hf-inference-providers/openai/gpt-oss-20b hle \
--bundle-dir gpt-oss-bundle \
--repo-id OpenEvals/evals \
--max-samples 100Resulting Space:
- Change model behaviour
You can use any argument defined in inspect-ai’s API.
lighteval eval hf-inference-providers/openai/gpt-oss-20b aime25 --temperature 0.1- Use model-args to use any inference provider specific argument.
lighteval eval google/gemini-2.5-pro aime25 --model-args location=us-east5lighteval eval openai/gpt-4o gpqa:diamond --model-args service_tier=flex,client_timeout=1200LightEval prints a per-model results table:
Completed all tasks in 'lighteval-logs' successfully
| Model |gpqa|gpqa:diamond|
|---------------------------------------|---:|-----------:|
|vllm/HuggingFaceTB/SmolLM-135M-Instruct|0.01| 0.01|
results saved to lighteval-logs
run "inspect view --log-dir lighteval-logs" to view the results