#evaluation #multi-metric #llm
BIG-bench (Beyond the Imitation Game Benchmark) is a comprehensive evaluation framework designed to assess the capabilities and limitations of language models across a wide range of tasks. It currently includes 204 tasks focusing on diverse topics such as linguistics, childhood development, math, common-sense reasoning, biology, physics, social bias, and software development.
A human expert rater team also performs all tasks to provide a strong baseline for comparison.
The benchmark has revealed that model performance and calibration improve with scale, but remain poor in absolute terms when compared to rater performance.
The benchmark also explores the impact of model size on performance and social bias, and how it can be improved through better prompting.
BIG-bench Lite (BBL) is a small subset of 24 diverse JSON tasks from BIG-bench. It is designed to provide a canonical measure of model performance while being far cheaper to evaluate than the full set of tasks in BIG-bench.
BIG-bench Hard (BBH) is a small subset of 23 challenging BIG-bench tasks. These tasks are ones in which prior language model evaluations did not outperform the average human-rater.
Since many tasks in BBH require multi-step reasoning, using few-shot prompting without CoT, as done in the BIG-Bench evaluations substantially underestimates the best performance and capabilities of language models. But applying chain-of-thought (CoT) prompting, PaLM surpassed the average human-rater performance on 10 of the 23 tasks, and Codex on 17 of the 23 tasks.