LM Evaluation Harness benchmark

#evaluation #multi-metric #llm

The Eleuther AI Language Model Evaluation Harness, is a unified framework to test generative language models on a large number of different evaluation tasks. It proposes over 60 standard academic benchmarks for LLMs, with hundreds of sub-tasks and variants.

It is used by the 🤗 Hugging Face's popular Open LLM Leaderboard to evaluates models on 6 (at the moment) key benchmarks :

For all these evaluations, a higher score is a better score.

These benchmarks have been chosen by HF because they test a variety of reasoning and general knowledge across a wide variety of fields in 0-shot and few-shot settings. But as previously stated, this framework can test on over 60 standard academic benchmarks.

More details here: