Evaluation tools
Evaluation leaderboards
π€ Open LLM Leaderboard (by Hugging-Face)
Track, rank and evaluate open LLMs and chatbots
π€ LLM-Perf Leaderboard (by Hugging-Face)
It aims to benchmark the performance (latency, throughput, memory & energy) of Large Language Models (LLMs) with different hardwares, backends and optimizations using Optimum-Benchmark and Optimum flavors.
- https://huggingface.co/spaces/optimum/llm-perf-leaderboard
- https://github.com/huggingface/optimum-benchmark
- https://github.com/huggingface/optimum
Eval plus leaderboard
Evaluates AI Coders with rigorous tests.
LMSys Chatbot Arena (Elo Rating)
An open crowdsourced platform to collect human feedback and evaluate LLMs under real-world scenarios.
- https://arxiv.org/abs/2306.05685
- https://chat.lmsys.org/
- https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard
LLM Safety Leaderboard (by AI-Secure)
It aims to provide a unified evaluation for LLM safety and help researchers and practitioners better understand the capabilities, limitations, and potential risks of LLMs.
MTEB (Massive Text Embedding Benchmark) Leaderboard
MTEB is a multi-task and multi-language comparison of embedding models. Itβs relatively comprehensive - 8 core embedding tasks Bitext mining, classification, clustering, pair classification, reranking, retrieval, semantic text similarity (sts), summarization and open source.
-
Easy to plugin new models through a very simple API
-
Easy to plugging new data set for existing tasks
Evaluation & Observability libraries
π€ LightEval (by Hugging-Face)
#evaluation #custom_evals
A lightweight LLM evaluation suite that Hugging Face has been using internally with the recently released LLM data processing library datatrove and LLM training library nanotron. It' is the tech behind HF leaderboard.
- https://huggingface.co/lighteval
- https://github.com/huggingface/lighteval
- https://www.linkedin.com/posts/philipp-schmid-a6a2bb196_evaluate-llms-with-hugging-face-lighteval-activity-7170795086224543745-R8U2?utm_source=share&utm_medium=member_android
Phoenix (by Arize)
#evaluation #custom_evals #observability
An open-source observability library and platform designed for experimentation, evaluation, and troubleshooting. The toolset is designed to ingest inference data for LLMs, CV, NLP, and tabular datasets as well as LLM traces. It allows AI Engineers and Data Scientists to quickly visualize their data, evaluate performance, track down issues & insights, and easily export to improve.
Open-source: True
License: Elastic License 2.0 (ELv2)Has Trace & Span tracking: True
Has Trace & Span UI: True (Free in Notebook/Docker/Terminal - Paid in Cloud)
Has LLM Evaluation: True
Has Pre-defined Evaluations: True
Has Custom Evaluations: TrueNote: Benchmark both the model and the prompt template. We can see both the dataset embedding AND the queries embedding of the customers as a reduced 2D representation and hence see BAD RESPONSES (what is missing in the database to correctly answer the customers questions).
Very good demo and explanations here:
- https://www.youtube.com/watch?v=hbQYDpJayFw
- http://bit.ly/llama-index-phoenix-tutorial
- https://phoenix-demo.arize.com/projects
Tested: False
Grade: ?/6
Here is an example from the Arize Phoenix debugging tool.
DeepEval
#evaluation #custom_evals #RAG
A simple-to-use, open-source LLM evaluation framework. It is similar to Pytest but specialized for unit testing LLM outputs. DeepEval incorporates the latest research to evaluate LLM outputs based on metrics such as hallucination, answer relevancy, RAGAS, etc., which uses LLMs and various other NLP models that runs locally on your machine for evaluation.
Whether your application is implemented via RAG or fine-tuning, LangChain or LlamaIndex, DeepEval has you covered. With it, you can easily determine the optimal hyperparameters to improve your RAG pipeline, prevent prompt drifting, or even transition from OpenAI to hosting your own Llama2 with confidence.
Ragas
#evaluation #custom_evals #RAG
Evaluation framework for your Retrieval Augmented Generation (RAG) pipelines. Ragas provides you with the tools based on the latest research for evaluating LLM-generated text to give you insights about your RAG pipeline. Ragas can be integrated with your CI/CD to provide continuous checks to ensure performance.
Giskard (by Giskard-AI)
#evaluation #custom_evals #RAG
The testing framework dedicated to ML models, from tabular to LLMs. Scan AI models to detect risks of biases, performance issues and errors. Provide PyTest integrations.
Here is an example from the Giskard evaluation tools.
Validate (by Tonic)
#evaluation #custom_evals #RAG #observability
A framework to make easy to evaluate, track, and monitor your LLM and RAG applications. Validate allows you to evaluate your LLM outputs through the use of our provided metrics which measure everything from answer correctness to LLM hallucination. Additionally, Validate has an optional UI to visualize your evaluation results for easy tracking and monitoring.
- https://github.com/TonicAI/tonic_validate
- https://docs.llamaindex.ai/en/stable/community/integrations/tonicvalidate.html
Language Model Evaluation Harness
#evaluation #custom_evals
This project provides a unified framework to test generative language models on a large number of different evaluation tasks. Over 60 standard academic benchmarks for LLMs, with hundreds of subtasks and variants implemented.
LangSmith (by LangChain)
#evaluation #custom_evals #observability
A platform for building production-grade LLM applications. It lets you debug, test, evaluate, and monitor chains and intelligent agents built on any LLM framework and seamlessly integrates with LangChain, the go-to open source framework for building with LLMs.
- https://docs.smith.langchain.com/
- https://www.langchain.com/langsmith
- https://python.langchain.com/docs/guides/evaluation/
LangChain benchmarks (based on LangSmith)
Truelens
#evaluation #custom_evals #observability
Evaluate, iterate faster, and select your best LLM app. TruLens is a software tool that helps you to objectively measure the quality and effectiveness of your LLM-based applications using feedback functions. Feedback functions help to programmatically evaluate the quality of inputs, outputs, and intermediate results, so that you can expedite and scale up experiment evaluation. Use it for a wide variety of use cases including question answering, summarization, retrieval-augmented generation, and agent-based applications.
UpTrain
#evaluation #custom_evals
An open-source unified platform to evaluate and improve Generative AI applications. We provide grades for 20+ preconfigured checks (covering language, code, embedding use-cases), perform root cause analysis on failure cases and give insights on how to resolve them.
Openai Evals
#evaluation #custom_evals
Provide a framework for evaluating large language models (LLMs) or systems built using LLMs. We offer an existing registry of evals to test different dimensions of OpenAI models and the ability to write your own custom evals for use cases you care about. You can also use your data to build private evals which represent the common LLMs patterns in your workflow without exposing any of that data publicly.
- https://github.com/openai/evals
- https://arize.com/blog-course/evals-openai-simplifying-llm-evaluation/
Langfuse
#observability
An open source LLM engineering platform to help teams collaboratively debug, analyze and iterate on their LLM Applications.
LlamaIndex Evaluation
#evaluation
LlamaIndex is meant to connect your data to your LLM applications. Sometimes, even after diagnosing and fixing bugs by looking at traces, more fine-grained evaluation is required to systematically diagnose issues. LlamaIndex aims to provide those tools to make identifying issues and receiving useful diagnostic signals easy.
Evaluation benchmarks
FLASK benchmark
defines four primary abilities which are divided into 12 fine-grained skills to evaluate the performance of language models comprehensively.
HELM benchmark
adopts a top-down approach explicitly specifying the scenarios and metrics to be evaluated and working through the underlying structure.
Evaluation models
vectara/hallucination_evaluation_model
The HHEM model is an open source model, for detecting hallucinations in LLMs. It is particularly useful in the context of building retrieval-augmented-generation (RAG) applications where a set of facts is summarized by an LLM, but the model can also be used in other contexts.
Evaluation datasets or benchmarks
SQuAD (Stanford Question Answering Dataset)
Used for Question & Answering Evaluation
WikiQA
Used for Retrieval Evaluation, Q&A Evaluation
- https://paperswithcode.com/dataset/wikiqa
- https://huggingface.co/datasets/wiki_qa
- https://www.microsoft.com/en-us/download/details.aspx?id=52419
MS Marco
Used for Retrieval Evaluation
- https://microsoft.github.io/msmarco/
- https://huggingface.co/datasets/ms_marco
- https://paperswithcode.com/dataset/ms-marco
WikiToxic
Used for Toxicity Evaluation
GigaWord
Used for Summarization Evaluation
CNNDM
Used for Summarization Evaluation
Xsum
Used for Summarization Evaluation
- https://huggingface.co/datasets/EdinburghNLP/xsum
- https://paperswithcode.com/sota/text-summarization-on-x-sum
WikiSQL
Used for Code Generation Evaluation
HumanEval
Used for Code Generation Evaluation
- https://github.com/openai/human-eval
- https://huggingface.co/datasets/openai_humaneval
- https://paperswithcode.com/sota/code-generation-on-humaneval
CodeXGlue
Used for Code Generation Evaluation
- https://microsoft.github.io/CodeXGLUE/
- https://github.com/microsoft/CodeXGLUE
- https://huggingface.co/datasets/code_x_glue_cc_code_to_code_trans
GSM8K
Used for Math / Problem solving
SVAMP
Used for Math / Problem solving
AQUA
Used for Math / Problem solving
MultiArith
Used for Math / Problem solving
CS-QA
Used for Common Sense
Strategy-SQ
Used for Common Sence
RAGTruth
Used for Hallucination Evaluation
MMLU benchmark
Used for Multitasking Evaluation
CoNLL-2003
Used for Named Entity Recognition (NER) Evaluation
Penn Treebank dataset
Used for Part Of Speech (POS) tagging Evaluation
Stanford Sentiment Treebank (SST)
Used for Sentiment Analysis Evaluation and Parsing Evaluation
DUC and TAC
Used for Summarization Evaluation
Questions generation
LlamaIndex Question Generation
Phoenix Question Generation
- https://docs.arize.com/phoenix/api/evals#phoenix.evals.llm_generate
- https://colab.research.google.com/github/Arize-ai/phoenix/blob/main/tutorials/evals/evaluate_rag.ipynb#scrollTo=vSE9yY4k-YeN (great generation template in the Evaluation section)