Evaluation tools

Evaluation leaderboards

MMMU Leaderboard

The MMMU (Massive Multi-discipline Multimodal Understanding and Reasoning) benchmark is designed to evaluate multimodal models on massive multi-discipline tasks demanding college-level subject knowledge and deliberate reasoning.

MMMU includes 11.5K meticulously collected multimodal questions from college exams, quizzes, and textbooks, covering six core disciplines: Art & Design, Business, Science, Health & Medicine, Humanities & Social Science, and Tech & Engineering.

These questions span 30 subjects and 183 subfields, comprising 30 highly heterogeneous image types, such as charts, diagrams, maps, tables, music sheets, and chemical structures.

Unlike existing benchmarks, MMMU focuses on advanced perception and reasoning with domain-specific knowledge, challenging models to perform tasks akin to those faced by experts.

https://mmmu-benchmark.github.io/#leaderboard

Artificial Analysis Leaderboards

Comparison and ranking the performance for LLM, Text-to-Speech, Speech-to-Text and Text-to-Image models.

Regarding the LLMs, it compares and ranks over 30 AI models (LLMs) across key metrics including quality, price, performance and speed (output speed - tokens per second & latency - TTFT), context window & others.

It also provides specific pages explaining the advantages and disadvantages for each of the considered models.

🤗 Open LLM Leaderboard (by Hugging-Face)

Track, rank and evaluate open LLMs and chatbots

https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard

🤗 LLM-Perf Leaderboard (by Hugging-Face)

It aims to benchmark the performance (latency, throughput, memory & energy) of Large Language Models (LLMs) with different hardwares, backends and optimizations using Optimum-Benchmark and Optimum flavors.

Eval plus leaderboard

Evaluates AI Coders with rigorous tests.

LMSys Chatbot Arena (Elo Rating)

An open crowdsourced platform to collect human feedback and evaluate LLMs under real-world scenarios.

LLM Safety Leaderboard (by AI-Secure)

It aims to provide a unified evaluation for LLM safety and help researchers and practitioners better understand the capabilities, limitations, and potential risks of LLMs.

https://huggingface.co/spaces/AI-Secure/llm-trustworthy-leaderboard

MTEB (Massive Text Embedding Benchmark) Leaderboard

MTEB is a multi-task and multi-language comparison of embedding models. It’s relatively comprehensive - 8 core embedding tasks Bitext mining, classification, clustering, pair classification, reranking, retrieval, semantic text similarity (sts), summarization and open source.

Easy to plugin new models through a very simple API
Easy to plugging new data set for existing tasks
https://huggingface.co/spaces/mteb/leaderboard
https://arxiv.org/pdf/2210.07316.pdf

BFCL (Berkeley Function-Calling Leaderboard)

BFLC is the first comprehensive evaluation on the LLM's ability to call functions and tools. Unlike previous function call evaluations, BFCL accounts for various forms of function calls, diverse function calling scenarios, and their executability.

Evaluation & Observability libraries

🤗 LightEval (by Hugging-Face)

#evaluation #custom_evals
A lightweight LLM evaluation suite that Hugging Face has been using internally with the recently released LLM data processing library datatrove and LLM training library nanotron. It' is the tech behind HF leaderboard.

Phoenix (by Arize)

#evaluation #custom_evals #observability
An open-source observability library and platform designed for experimentation, evaluation, and troubleshooting. The toolset is designed to ingest inference data for LLMs, CV, NLP, and tabular datasets as well as LLM traces. It allows AI Engineers and Data Scientists to quickly visualize their data, evaluate performance, track down issues & insights, and easily export to improve.

Open-source: True
License: Elastic License 2.0 (ELv2)

Has Trace & Span tracking: True
Has Trace & Span UI: True (Free in Notebook/Docker/Terminal - Paid in Cloud)
Has LLM Evaluation: True
Has Pre-defined Evaluations: True
Has Custom Evaluations: True

Note: Benchmark both the model and the prompt template. We can see both the dataset embedding AND the queries embedding of the customers as a reduced 2D representation and hence see BAD RESPONSES (what is missing in the database to correctly answer the customers questions).

Very good demo and explanations here:

https://www.youtube.com/watch?v=hbQYDpJayFw

http://bit.ly/llama-index-phoenix-tutorial

https://phoenix-demo.arize.com/projects

Tested: False
Grade: ?/6

Here is an example from the Arize Phoenix debugging tool.

DeepEval

#evaluation #custom_evals #RAG
A simple-to-use, open-source LLM evaluation framework. It is similar to Pytest but specialized for unit testing LLM outputs. DeepEval incorporates the latest research to evaluate LLM outputs based on metrics such as hallucination, answer relevancy, RAGAS, etc., which uses LLMs and various other NLP models that runs locally on your machine for evaluation.

Whether your application is implemented via RAG or fine-tuning, LangChain or LlamaIndex, DeepEval has you covered. With it, you can easily determine the optimal hyperparameters to improve your RAG pipeline, prevent prompt drifting, or even transition from OpenAI to hosting your own Llama2 with confidence.

https://github.com/confident-ai/deepeval

Ragas

#evaluation #custom_evals #RAG
Evaluation framework for your Retrieval Augmented Generation (RAG) pipelines. Ragas provides you with the tools based on the latest research for evaluating LLM-generated text to give you insights about your RAG pipeline. Ragas can be integrated with your CI/CD to provide continuous checks to ensure performance.

https://github.com/explodinggradients/ragas

AutoRAG

#evaluation #RAG
RAG AutoML tool for automatically finds an optimal RAG pipeline for your data.

There are many RAG pipelines and modules out there, but you don’t know what pipeline is great for “your own data” and "your own use-case." Making and evaluating all RAG modules is very time-consuming and hard to do. But without it, you will never know which RAG pipeline is the best for your own use-case.

AutoRAG is a tool for finding optimal RAG pipeline for “your data.” You can evaluate various RAG modules automatically with your own evaluation data, and find the best RAG pipeline for your own use-case.

AutoRAG supports a simple way to evaluate many RAG module combinations. Try now and find the best RAG pipeline for your own use-case.

https://github.com/Marker-Inc-Korea/AutoRAG

Giskard (by Giskard-AI)

#evaluation #custom_evals #RAG #assess_vulnerabilities
The testing framework dedicated to ML models, from tabular to LLMs. Scan AI models to detect risks of biases, performance issues and errors. Provide PyTest integrations.

Here is an example from the Giskard evaluation tools.

Here is an example from the Giskard Risk assessment :

Validate (by Tonic)

#evaluation #custom_evals #RAG #observability
A framework to make easy to evaluate, track, and monitor your LLM and RAG applications. Validate allows you to evaluate your LLM outputs through the use of our provided metrics which measure everything from answer correctness to LLM hallucination. Additionally, Validate has an optional UI to visualize your evaluation results for easy tracking and monitoring.

Language Model Evaluation Harness

#evaluation #custom_evals
This project provides a unified framework to test generative language models on a large number of different evaluation tasks. Over 60 standard academic benchmarks for LLMs, with hundreds of subtasks and variants implemented.

https://github.com/EleutherAI/lm-evaluation-harness

LangSmith (by LangChain)

#evaluation #custom_evals #observability
A platform for building production-grade LLM applications. It lets you debug, test, evaluate, and monitor chains and intelligent agents built on any LLM framework and seamlessly integrates with LangChain, the go-to open source framework for building with LLMs.

LangChain benchmarks (based on LangSmith)

https://langchain-ai.github.io/langchain-benchmarks/

Truelens

#evaluation #custom_evals #observability
Evaluate, iterate faster, and select your best LLM app. TruLens is a software tool that helps you to objectively measure the quality and effectiveness of your LLM-based applications using feedback functions. Feedback functions help to programmatically evaluate the quality of inputs, outputs, and intermediate results, so that you can expedite and scale up experiment evaluation. Use it for a wide variety of use cases including question answering, summarization, retrieval-augmented generation, and agent-based applications.

UpTrain

#evaluation #custom_evals
An open-source unified platform to evaluate and improve Generative AI applications. We provide grades for 20+ preconfigured checks (covering language, code, embedding use-cases), perform root cause analysis on failure cases and give insights on how to resolve them.

Openai Evals

#evaluation #custom_evals
Provide a framework for evaluating large language models (LLMs) or systems built using LLMs. We offer an existing registry of evals to test different dimensions of OpenAI models and the ability to write your own custom evals for use cases you care about. You can also use your data to build private evals which represent the common LLMs patterns in your workflow without exposing any of that data publicly.

Langfuse

#observability
An open source LLM engineering platform to help teams collaboratively debug, analyze and iterate on their LLM Applications.

LlamaIndex Evaluation

#evaluation
LlamaIndex is meant to connect your data to your LLM applications. Sometimes, even after diagnosing and fixing bugs by looking at traces, more fine-grained evaluation is required to systematically diagnose issues. LlamaIndex aims to provide those tools to make identifying issues and receiving useful diagnostic signals easy.

https://docs.llamaindex.ai/en/stable/optimizing/evaluation/evaluation.html

MixEval

#evaluation
A ground-truth-based dynamic benchmark derived from off-the-shelf benchmark mixtures, which evaluates LLMs with a highly capable model ranking (i.e., 0.96 correlation with Chatbot Arena) while running locally and quickly (6% the time and cost of running MMLU), with its queries being stably and effortlessly updated every month to avoid contamination. It consists of two benchmarks: MixEval and MixEval-Hard, both updated with our fast, stable pipeline periodically.

https://github.com/Psycoy/MixEval

Prompt-flow Tracing (by Azure)

#observability
Traces records specific events or the state of an application during execution. It can include data about function calls, variable values, system events and more. Traces help break down an application’s components into discrete inputs and outputs, which is crucial for debugging and understanding an application.

https://microsoft.github.io/promptflow/how-to-guides/tracing/index.html

Promptfoo

#evaluation #testing #assess_vulnerabilities #custom_evals #RAG
Promptfoo is a test-driven LLM development (instead of trial-and-error) tool for testing, evaluating, and red-teaming LLM apps.

Build reliable prompts, models, and RAGs with benchmarks specific to your use-case
Secure your apps with automated red teaming and pentesting
Speed up evaluations with caching, concurrency, and live reloading
Score outputs automatically by defining metrics
Use as a CLI, library, or in CI/CD
Use OpenAI, Anthropic, Azure, Google, HuggingFace, open-source models like Llama, or integrate custom API providers for any LLM API
Works either as a web-client or as a command line tool
Offers original asserts for the tests (cost, latency, factuality etc...)
https://github.com/promptfoo/promptfoo
https://www.promptfoo.dev/

Here is an example from the Promptfoo evaluation :

Here is an example from the Promptfoo Risk assessment :

Evaluation benchmarks

FLASK benchmark

defines four primary abilities which are divided into 12 fine-grained skills to evaluate the performance of language models comprehensively.

https://github.com/kaistAI/FLASK

HELM benchmark

adopts a top-down approach explicitly specifying the scenarios and metrics to be evaluated and working through the underlying structure.

https://github.com/stanford-crfm/helm

DecodingTrust

DecodingTrust aims at providing a thorough assessment of trustworthiness in GPT models.

This research endeavor is designed to help researchers and practitioners better understand the capabilities, limitations, and potential risks involved in deploying these state-of-the-art Large Language Models (LLMs).

This project is organized around the following eight primary perspectives of trustworthiness, including:

Toxicity
Stereotype and bias
Adversarial robustness
Out-of-Distribution Robustness
Privacy
Robustness to Adversarial Demonstrations
Machine Ethics
Fairness

Repo and code are provided to assess these perspectives of trustworthiness on GPT 3.5 and GPT 4

https://decodingtrust.github.io/

Evaluation models

vectara/hallucination_evaluation_model

The HHEM model is an open source model, for detecting hallucinations in LLMs. It is particularly useful in the context of building retrieval-augmented-generation (RAG) applications where a set of facts is summarized by an LLM, but the model can also be used in other contexts.

https://huggingface.co/vectara/hallucination_evaluation_model

Evaluation datasets or benchmarks

SQuAD (Stanford Question Answering Dataset)

Used for Question & Answering Evaluation

https://huggingface.co/datasets/rajpurkar/squad

ChatRAG-Bench (Nvidia)

This is a benchmark for evaluating a model's conversational QA capability over documents or retrieved context. It is built on and derived from 10 existing datasets: Doc2Dial, QuAC, QReCC, TopioCQA, INSCIT, CoQA, HybriDialogue, DoQA, SQA, ConvFinQA. ChatRAG Bench covers a wide range of documents and question types, which require models to generate responses from long context, comprehend and reason over tables, conduct arithmetic calculations, and indicate when questions cannot be found within the context.