Observability - Evaluations

References and useful ressources

LLMs' problems

Possible causes

What do we need to evaluate ?

Evaluation must cover different aspects of use of the LLM (both by tasks and by semantic category and from NLP metrics to production metrics to end-to-end system metrics).

  1. Embedding Model : is the embedding correctly separating the concepts of the use-case?
  2. LLM Model : is the model performing well generally?
  3. LLM System - RAG Retrieval : is the retrieved data appropriate? are the provided links/src OK?
  4. LLM System - RAG Generation : is the generated answer using the provided context?
  5. LLM End-to-End : overall are the selected Model and System (Q&A, Summarization, Code, etc.) answering the user questions?

Embedding Model Evaluations

There are standard traditional information retrieval methods to evaluate embeddings such as Recall@K, Precision@K, NDCG@K that are easy to implement.

Another critical part of embedding evaluation is the performance/operations evaluation: Cost, latency, throughput. Because in many cases, embeddings must be generated for every item on every update or for every query with latency limits...

But there also exists more complex evaluation tools such as the MTEB which is a good example of a more complete embedding evaluation tasks. This board also shows that there are no clear winners across all tasks. So we need to select or train one depending on the current use-case.

LLM Model Evaluations

In this case we are evaluating the overall performance of the foundational models. Other said, we compare several foundational models (or versions), just like with regular Machine Learning evaluations. So the very same dataset(s) are used on each model to assess a given list of tasks.


Typically, early stage / pre-training is evaluated on standard NLP metrics such as SQuaD 2.0, SNLI, GLUE, BLUE score, ROUGE score etc. that are good to measure standard NLP tasks.

LLM System Evaluations

In this case, we are evaluating how well the inputs can determine the outputs. We observe the various components in the LLM system that we have control over (including fine-tuning or adapter-tuning), to see why we ends up with a good or a wrong answer in the ouput. So the very same dataset is used on each LLM system to assess a given list of tasks and compare them.


Evaluation Metrics

  1. LLM assisted evaluation
    • Different templates and different models for different tasks (Relevance, Hallucinations, Toxicity, Honesty ,Question answering eval, Summarization eval, Translation eval, Code generation eval, Ref Link eval ... )
  2. User-provided feedback
    • Thumbs up / thumbs down
    • Accept or reject response
    • ...
  3. Task based metrics (and benchmarks)
  4. Multi-metric evaluation
    • FLASK benchmark - defines four primary abilities which are divided into 12 fine-grained skills to evaluate the performance of language models comprehensively.
    • HELM benchmark - adopts a top-down approach explicitly specifying the scenarios and metrics to be evaluated and working through the underlying structure.
    • LM Evaluation Harness benchmark - a unified framework used by the Hugging Face LLM Leaderboard, to test generative language models on a large number of different evaluation tasks.
    • ...
  5. RAG / Ranking / Recommendation metrics
    • Precision@K - measures the percentage of relevant documents amongst the top K retrieved documents (without taking into account the position of the item in the list),
    • NDCG - compare rankings to an ideal order where all relevant items are at the top,
    • Hit rate - measures the share of users that get at least one relevant recommendation (it can be a binary True/False if considering one query, or a ratio if considering K queries),
    • MRR - helps understand the average position of the first relevant item across all user lists,
    • ...

⚠️ Recommendation: use Confusion Matrices with categorical variables rather than just scores because scores are way harder to interpret and hence to dig for the problem.

Evaluation datasets

There exist a lot of evaluation datasets (see _Tools) ready to assess Common Sense, Math & Problem Solving, Q&A, Summarization, Code generation ... and they are always a good starting point or a good complement.

But nothing beats a Golden Dataset built with the project's data and specifically tailored for a given use case.

Ideally, this dataset should be created/curated by Humans with enough expertise in the domain covered by the data, but in practice, it's hard to gather such a team at a reasonable price... So, an alternative is to rely on very good LLMs (such as GTP-4 or Claude-3 at the time of writing) to help generate interesting and useful examples.

Synthetic Data Generation for Retrieval Evaluations

  1. Parse / chunk-up text corpus.
  2. Prompt an LLM (GTP-4) to generate questions from each chunk (or for subset of chunks).
  3. If the dataset aims to be a Golden Dataset, carefully review the pairs.
  4. Each pair (question, chunk) can be used for Evaluation or Fine-tuning.


Synthetic Data Generation for End-to-End Evaluations

  1. Parse / chunk-up text corpus.
  2. Prompt an LLM (GTP-4) to generate questions from each chunk (or for subset of chunks).
  3. Run (question, context) pairs through an LLM (GPT-4) to get a ground-truth response.
  4. If the dataset aims to be a Golden Dataset, carefully review the pairs.
  5. Each pair (question, ground-truth-response) can be used for Evaluation or Fine-tuning.


Good practices

  1. Benchmark using a Golden Dataset.
  2. Use cross-validation on the Golden Dataset to avoid overfitting (hold-out, k-fold ...).
  3. Evaluate the LLM based on the appropriate task(s) (not a generic metric unrelated to the use-case).
  4. Carefully define the LLM Evaluation Templates (or use libraries with built-in prompt templates).
  5. Run End-to-end evaluation first, and if there is a problem run other evals.
  6. Queries and document change over time, so the evaluations must reflect these changes to avoid Concept drift or Data drift.
  7. Select the foundation model using relevant metrics for the use case.

    In the above table, we can see that even if GPT-3.5-turbo-instruct is not a good model for Q&A, it might be a good (and cheap) option for Pure completion (such as playing a chess game, generating code, generating poems etc...)

How to fix Hallucinations / Responses quality

  1. Improve retrieval / ranking of relevant documents
  2. Improve chunking strategy & document quality
  3. Improve prompt (prompt engineering)


Notebooks from Arize Phoenix

Lab 1

Evaluating Hallucinations

Lab 2

Evaluating Toxicity

Lab 3

Evaluating Relevance of Retrieved Documents

Lab 4

Evaluating RAG with Phoenix's LLM Evals

Lab 5

Evaluating RAG with Giskard-AI

Lab 6

Evaluating Question-Answering

Lab 7

Evaluating Summarization

Lab 8

Evaluating Code Readability

Lab 9

Evaluating and Improving a LlamaIndex Search and Retrieval Application
Open in ColabOpen in Colab

Metrics Ensembling

⚠️ TODO: write this section

LLM Model Evaluation libraries, metrics & leaderbords




Basic steps to improve the answers

  1. Zero-shot prompts
  2. Few-shot prompts
  3. Retrieval-Augmented few-shot prompts
  4. Fine-tuning
  5. Custom model