Last Updated on 2024-08-10 by Clay
Introduction
The year 2023 witnessed an explosion of generative AI technologies, with a myriad of applications emerging across various domains. In the field of Natural Language Processing (NLP), Large Language Models (LLMs) stand out as one of the most significant advancements. By training LLMs effectively and reducing hallucinations, they can significantly reduce human effort across a wide range of tasks.
Among these advancements, LLMs based on Retrieval Augmented Generation (RAG) are particularly noteworthy, as they introduce new knowledge into the model and help reduce hallucinations. The architecture of RAG is illustrated in the image below:
Starting with the user query in the top right, the system retrieves relevant documents using a retrieval system (a combination of an Embedding Model and a Vector Database). The LLM then reads both the Retrieved Contexts and the Query to generate a Response.
This is a highly useful system, and it has given rise to several enhanced architectures, such as RA-DIT and Self-RAG. However, how can we determine which model is best suited for RAG? And which system architecture is most compatible with the current LLM?
In other words, we need to ‘evaluate‘ the effectiveness of RAG.
However, as is widely known, quantifying generated text and evaluating it is far more challenging than traditional AI classifications. Even using metrics like BLEU and ROUGE-N to assess model-generated answers is still insufficient for today’s LLMs.
After going through various Medium articles, blog posts, and research papers, I recently came across a paper that my colleague recommended, which I found quite relevant: RAGAS: Automated Evaluation of Retrieval Augmented Generation.
Introducing RAGAS
RAGAS (Retrieval Augmented Generation Assessment) is an automated evaluation framework that is domain-agnostic. You can find the related code on the official GitHub page, and it has been integrated into well-known LLM pipeline frameworks like llama-index and LangChain.
The RAGAS framework uses an LLM to evaluate the performance of the current RAG system. To be precise, the default evaluation model is gpt-3.5-turbo-16k, which assesses different aspects of the RAG system.
Of course, this evaluation framework allows us to freely replace the evaluation model (note that this model is different from the one integrated into the RAG system). For example, you can switch to GPT-4 or a custom LLM that suits your needs.
Evaluation Strategies
The research team considered the standard RAG setup: Given a question q, the system retrieves relevant information c(q), which the LLM uses to generate an answer a(q).
In this particular task of retrieval-based answer generation, there usually isn’t a human-annotated dataset available. Manually evaluating RAG-based LLMs is too time-consuming (each time a RAG system setting is adjusted, the entire system needs to be re-evaluated). Therefore, the team chose to use an LLM for evaluation (the current standard is to use GPT-4).
But how do we evaluate LLMs with LLMs? It’s well known that LLMs aren’t particularly sensitive to numbers, and they can exhibit a degree of hallucination and bias. So, it’s necessary to set evaluation targets manually for the LLM.
In the RAGAS paper, three evaluation dimensions are defined:
- Faithfulness
- Answer Relevance
- Context Relevance
The official framework documentation also mentions two additional dimensions:
4. Context Precision
5. Context Recall
Let’s go through each of these dimensions and what they mean. Note that when I refer to the evaluation LLM, I’m talking about the LLM that scores the RAG system, not the one integrated into the RAG system.
Faithfulness
This metric quantifies the factual consistency between the ‘generated response‘ and the ‘retrieved content,’ with the score mapped to a range of (0, 1). Naturally, the higher the score, the better.
First, we give the evaluation LLM a set of prompts (which you can customize according to your evaluation needs) to extract more statements from the RAG system’s generated response.
Given a question and answer, create one or more statements from each sentence in the given answer.
question: [question]
answer: [answer]
After extracting more statements, we have the evaluation LLM assess whether each statement can be inferred from the retrieved content.
Consider the given context and following statements, then determine whether they are supported by the information present in the context. Provide a final verdict for each statement in order at the end in the given format. Do not deviate from the specified format.
statement: [statement 1]
...
statement: [statement 2]
The final faithfulness score is calculated as ‘the number of statements that can be inferred from the contexts / total number of statements‘.
For example, if your evaluation result is as follows:
CONTEXT: A = 1, B = 2, A + B = 3. ANSWER: A = 1, B = 2, C = 3, A + B = 3, A + C = 4. STATEMENTS: O | A = 1 O | B = 2 X | C = 3 O | A + B = 3 X | A + C = 4
Then your RAG system has significant hallucination issues, and the resulting faithfulness score would be 3/5, or 0.6.
Conversely, if your evaluation results are excellent:
CONTEXT: A = 1, B = 2, A + B = 3. ANSWER: Because A = 1 and B = 2, A + B = 3. STATEMENTS: O | A = 1 O | B = 2 O | A + B = 3
Then the score would naturally be 3/3 = 1.0, a perfect score.
Answer Relevance
For answer relevance, the evaluation LLM generates a ‘new question‘ based on the answer produced by the RAG system and calculates the similarity between this new question and the original user question.
In the paper, the similarity calculation is performed using OpenAI’s text-embedding-ada-002
. You can generate multiple questions to calculate an average score.
We can think of it this way: if the RAG-generated answer cannot adequately address the original user query, the newly generated question will likely be quite different from the original user question in the vector space.
So this is indeed a viable way to evaluate whether the RAG-generated response is relevant to the user’s question.
Context Relevance
Similar to the faithfulness metric, this time, we break down the retrieved contexts into sentences. The evaluation LLM then determines whether each sentence can help answer the user’s question.
Please extract relevant sentences from
the provided context that can potentially
help answer the following question. If no
relevant sentences are found, or if you
believe the question cannot be answered
from the given context, return the phrase
"Insufficient Information". While extracting candidate sentences you’re not allowed to make any changes to sentences
from given context.
As before, we can calculate the context relevance score by dividing the number of question-relevant sentences by the total number of sentences in the context.
Context Precision
The following two metrics require a Ground Truth to calculate. If you want to completely eliminate manual involvement, these metrics might not be applicable since they require manual creation of Ground Truth.
The variable k represents the top-k retrieved contexts. True positive and false positive can be understood as ‘when I predict 1, the answer is indeed 1’ and ‘when I predict 1, the answer is actually 0,’ respectively.
Looking at the source code:
denominator = sum(response) + 1e-10
numerator = sum(
[
(sum(response[: i + 1]) / (i + 1)) * response[i]
for i in range(len(response))
]
)
scores.append(numerator / denominator)
The response usually refers to the evaluation LLM’s judgment of whether the context is relevant to the question. Suppose we have two different retrieval systems, and their evaluated context lists are [1, 1, 0] and [1, 0, 1], respectively.
When calculating the denominator, both scores would be 2 + 1e-10 (this is a way to avoid having a score of 0).
However, when calculating the numerator, the scores differ. The former has a score of 2.0, while the latter has 1.66. Intuitively, we would prefer the retrieval system that ranks the relevant contexts higher.
Context Recall
Recall requires a Ground Truth. The calculation method is to list the sentences from the Ground Truth that are related to the retrieved contexts and divide them by the total number of sentences in the Ground Truth. This helps assess the coverage of the Ground Truth by the retrieved contexts.
How to Use
Using the ragas
framework is straightforward. First, install the package.
pip install ragas
Next, set up your OpenAI API key and prepare your own dataset (here’s a sample dataset).
import os
os.environ["OPENAI_API_KEY"] = "your-openai-key"
from datasets import load_dataset
fiqa_eval = load_dataset("explodinggradients/fiqa", "ragas_eval")
Then, select the evaluation methods you want to use.
from ragas.metrics import (
answer_relevancy,
faithfulness,
context_recall,
context_precision,
)
from ragas import evaluate
result = evaluate(
fiqa_eval["baseline"].select(range(3)), # selecting only 3
metrics=[
context_precision,
faithfulness,
answer_relevancy,
context_recall,
],
)
df = result.to_pandas()
df.head()
Since the evaluation framework’s prompts ultimately need to align with your specific needs, I still recommend implementing your own evaluation methods when possible.