LM Evaluation Harness benchmark

#evaluation #multi-metric #llm

The Eleuther AI Language Model Evaluation Harness, is a unified framework to test generative language models on a large number of different evaluation tasks. It proposes over 60 standard academic benchmarks for LLMs, with hundreds of sub-tasks and variants.

It is used by the 🤗 Hugging Face's popular Open LLM Leaderboard to evaluates models on 6 (at the moment) key benchmarks :

AI2 Reasoning Challenge (25-shot) - a set of grade-school science questions.

HellaSwag (10-shot) - a test of commonsense inference, which is easy for humans (~95%) but challenging for SOTA models.

MMLU (5-shot) - a test to measure a text model's multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more.

TruthfulQA (0-shot) - a test to measure a model's propensity to reproduce falsehoods commonly found online. Note: TruthfulQA is technically a 6-shot task in the Harness because each example is prepended with 6 Q/A pairs, even in the 0-shot setting.

Winogrande (5-shot) - an adversarial and difficult Winograd benchmark at scale, for commonsense reasoning.

GSM8k (5-shot) - diverse grade school math word problems to measure a model's ability to solve multi-step mathematical reasoning problems.

For all these evaluations, a higher score is a better score.

These benchmarks have been chosen by HF because they test a variety of reasoning and general knowledge across a wide variety of fields in 0-shot and few-shot settings. But as previously stated, this framework can test on over 60 standard academic benchmarks.

More details here: