RAG Evaluation Toolkit on an IPCC Climate AgentΒΆ
Giskard is an open-source framework for testing all ML models, from LLMs to tabular models. Donβt hesitate to give the project a star on GitHub βοΈ if you find it useful!
In this notebook, youβll learn how to create a test dataset for a RAG pipeline and use this dataset to test the model.
In this example, we illustrate the procedure using OpenAI Client that is the default one; however, please note that our platform supports a variety of language models. For details on configuring different models, visit our π€ Setting up the LLM Client page
In this tutorial we will use Giskard LLM RAG Evaluation Toolkit to automatically detect issues of a Retrieval Augmented Generation (RAG) pipeline. We will test a model that answers questions about the IPCC report.
Use-case:
QA over the IPCC report
Foundational model: gpt-3.5-turbo
Context: IPCC report
Outline:
Create a test dataset for the RAG pipeline
Automatically evaluate the RAG pipeline and provide a report with recommendations
Install dependencies and setup notebookΒΆ
Letβs install the required dependencies. We will be using giskard[llm]
to create the test dataset and llama-index
to build the RAG pipeline. Additionally, we will use PyMuPDF
to load the IPCC report.
[ ]:
!pip install "giskard[llm]" --upgrade
!pip install llama-index PyMuPDF
Next, we download the IPCC report and save it as ipcc_report.pdf
.
[ ]:
!wget "https://www.ipcc.ch/report/ar6/syr/downloads/report/IPCC_AR6_SYR_LongerReport.pdf" -O "ipcc_report.pdf"
Now, we can import all of the required libraries and classess
[ ]:
import os
import warnings
import openai
import pandas as pd
from llama_index.core import VectorStoreIndex
from llama_index.core.base.llms.types import ChatMessage, MessageRole
from llama_index.core.node_parser import SentenceSplitter
from llama_index.readers.file import PyMuPDFReader
from giskard.rag import AgentAnswer, KnowledgeBase, QATestset, RAGReport, evaluate, generate_testset
from giskard.rag.metrics.ragas_metrics import ragas_context_precision, ragas_context_recall
Now, letβs set the OpenAI API Key environment variable and some visual options.
[ ]:
# Set the OpenAI API Key environment variable.
OPENAI_API_KEY = "..."
openai.api_key = OPENAI_API_KEY
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
# Set pandas options
pd.set_option("display.max_colwidth", 400)
warnings.filterwarnings("ignore")
Build RAG Agent on the IPCC reportΒΆ
We will use llama-index
to build the RAG pipeline. We will use the VectorStoreIndex
to create an index of the IPCC report. We will then use the as_chat_engine
method to create a chat engine from the index.
Now, we can use the pyMuPDF
reader to load the IPCC report and create a VectorStoreIndex
. We will also use the SentenceSplitter
to split the report into chunks of 512 tokens to ensure that the context is not too large.
[2]:
loader = PyMuPDFReader()
ipcc_documents = loader.load(file_path="./ipcc_report.pdf")
splitter = SentenceSplitter(chunk_size=512)
index = VectorStoreIndex.from_documents(ipcc_documents, transformations=[splitter])
chat_engine = index.as_chat_engine()
Underneath, you can see an image of a normal Retrieval Augmented Generation (RAG) pipeline. We phrase the question, the model generates an answer, and the context is retrieved from the vector store. This context is then passed to the model to generate the answer that is grounded relevant and up-to-date knowledge.
Letβs test the AgentΒΆ
We can now simple chat with our agent using the chat_engine
and the chat
method. Under the hood, this will use the VectorStoreIndex
to retrieve the most relevant chunks of the report and the gpt-3.5-turbo
model to answer the question.
[4]:
str(chat_engine.chat("How much will the global temperature rise by 2100?"))
[4]:
'The predicted global temperature rise by 2100 is 3.2 degrees Celsius, with a range of 2.2 to 3.5 degrees Celsius.'
Generate a test set on the IPCC reportΒΆ
Now our agent is ready to be tested. We can generate a test set using the generate_testset
function. Before we do that, we need to create a giskard
KnowledgeBase
class based on splits within the ipcc_documents
that we loaded earlier. We assign the text
of each document to the knowledge_base_df
dataframe and then create a KnowledgeBase
class from it.
[ ]:
text_nodes = splitter(ipcc_documents)
knowledge_base_df = pd.DataFrame([node.text for node in text_nodes], columns=["text"])
knowledge_base = KnowledgeBase(knowledge_base_df)
Now, this KnowledgeBase
class will be used to generate a test set. We will generate 120 questions and use the agent_description
to describe the agent. This will be used to generate a test set that is representative of the agentβs performance.
[ ]:
testset = generate_testset(
knowledge_base, num_questions=120, agent_description="A chatbot answering questions about the IPCC report"
)
To avoid losing the test set, we can save it to a JSONL file and safely load it later. Note that, we need to ensure the documents in the KnowledgeBase
are the same as the ones in the testset
to evaluate the agentβs performance on this test set.
[7]:
# Save the testset
testset.save("ipcc_testset.jsonl")
# Load the testset
testset = QATestset.load("ipcc_testset.jsonl")
Letβs take a look at the first 5 questions in the test set. We can see that the questions are representative of the agentβs performance and get a good coverage of the IPCC report.
[8]:
testset.to_pandas().head(5)
[8]:
question | reference_answer | reference_context | conversation_history | metadata | |
---|---|---|---|---|---|
id | |||||
1cacb231-b6e3-44aa-a315-79aa43cff369 | When is the best estimate of reaching 1.5Β°C of global warming according to most scenarios? | The best estimate of reaching 1.5Β°C of global warming lies in the first half of the 2030s in most of the considered scenarios and modelled pathways. | Document 116: The best estimate of reaching 1.5Β°C of global \nwarming lies in the ο¬rst half of the 2030s in most of the considered \nscenarios and modelled pathways114. In the very low GHG emissions \nscenario (SSP1-1.9), CO2 emissions reach net zero around 2050 and the \nbest-estimate end-of-century warming is 1.4Β°C, after a temporary overshoot \n(see Section 3.3.4) of no more than 0.1Β°C abov... | [] | {'question_type': 'simple', 'seed_document_id': 116, 'topic': 'Climate Change Mitigation Scenarios'} |
d785c257-4c44-443c-99dd-7d72a296da9f | What are the projected global emissions for 2030 based on policies implemented by the end of 2020? | The median projected global emissions for 2030 based on policies implemented by the end of 2020 are 57 GtCO2-eq/yr, with a range of 52β60 GtCO2-eq/yr. | Document 82: Emissions projections for 2030 and gross differences in emissions are based on emissions of 52β56 GtCO2-eq yrβ1 in 2019 as assumed in underlying model \nstudies97. (medium conο¬dence) {WGIII Table SPM.1} (Table 3.1, Cross-Section Box.2) \n95 \nAbatement here refers to human interventions that reduce the amount of GHGs that are released from fossil fuel infrastructure to the atmosph... | [] | {'question_type': 'simple', 'seed_document_id': 82, 'topic': 'Global Greenhouse Gas Emissions and Climate Policy'} |
0646700a-9617-4dad-9a12-f84a6048ca9d | What are some key barriers to the implementation of adaptation options in vulnerable sectors? | Key barriers include limited resources, lack of private-sector and civic engagement, insufficient mobilisation of finance, lack of political commitment, limited research and/or slow and low uptake of adaptation science, and a low sense of urgency. | Document 95: 62\nSection 2\nSection 1\nSection 2\nο¬re-adapted ecosystems, or hard defences against ο¬ooding) and human \nsettlements (e.g. stranded assets and vulnerable communities that \ncannot afford to shift away or adapt and require an increase in social \nsafety nets). Maladaptation especially affects marginalised and vulnerable \ngroups adversely (e.g., Indigenous Peoples, ethnic minorit... | [] | {'question_type': 'simple', 'seed_document_id': 95, 'topic': 'Others'} |
0d78955c-f9c8-41ad-9ba4-a2670da4e63c | What are some irreversible changes projected due to continued GHG emissions? | Many changes due to past and future GHG emissions are irreversible on centennial to millennial time scales, especially in the ocean, ice sheets, and global sea level. | Document 118: {WGI SPM D.1.7, WGI Box TS.7} (Cross-Section Box.2)\nContinued GHG emissions will further affect all major climate \nsystem components, and many changes will be irreversible on \ncentennial to millennial time scales. Many changes in the climate \nsystem become larger in direct relation to increasing global warming. \nWith every additional increment of global warming, changes in \... | [] | {'question_type': 'simple', 'seed_document_id': 118, 'topic': 'Others'} |
00d34731-7f09-446d-80fb-40c0b20b547a | What are some options for scaling up mitigation and adaptation in developing regions according to the context? | Options include increased levels of public finance and publicly mobilised private finance flows from developed to developing countries, increasing the use of public guarantees to reduce risks and leverage private flows at lower cost, local capital markets development, and building greater trust in international cooperation processes. | Document 291: Accelerated support \nfrom developed countries and multilateral institutions is a critical \nenabler to enhance mitigation and adaptation action and can address \ninequities in ο¬nance, including its costs, terms and conditions, and \neconomic vulnerability to climate change. Scaled-up public grants for \nmitigation and adaptation funding for vulnerable regions, e.g., in Sub-\nSah... | [] | {'question_type': 'simple', 'seed_document_id': 291, 'topic': 'Climate Change Mitigation and Adaptation'} |
Evaluate and Diagnose the AgentΒΆ
We can now evaluate the agentβs performance on the test set using the RAG Evaluation Toolkit (RAGET). We will use the evaluate
function to evaluate the agentβs performance on the test set. We will use the ragas_context_recall
and ragas_context_precision
metrics to evaluate the agentβs performance on the test set. We will also use the RAGReport
class to generate a report of the agentβs performance.
[ ]:
def answer_fn(question: str, history: list[dict] = None) -> AgentAnswer:
if history:
answer = chat_engine.chat(
question,
chat_history=[
ChatMessage(
role=MessageRole.USER if msg["role"] == "user" else MessageRole.ASSISTANT, content=msg["content"]
)
for msg in history
],
)
else:
answer = chat_engine.chat(question, chat_history=[])
return AgentAnswer(message=answer.response, documents=[source.content for source in answer.sources])
report = evaluate(
answer_fn, testset=testset, knowledge_base=knowledge_base, metrics=[ragas_context_recall, ragas_context_precision]
)
Now, we can save the report and load it later to display it.
[11]:
# Save the RAG report
report.save("ipcc_report")
# Load the RAG report
report = RAGReport.load("ipcc_report")
We can also share the report with others to get their feedback.
[12]:
display(report.to_html(embed=True))
RAGET question typesΒΆ
For RAGET, we have 6 different question types that assess different RAG components. Each question type assesses a few RAG components. This makes it possible to localize weaknesses in the RAG Agent and give feedback to the developers.
Question type |
Description |
Example |
Targeted RAG components |
---|---|---|---|
Simple |
Simple questions generated from an excerpt of the knowledge base |
How much will the global temperature rise by 2100? |
|
Complex |
Questions made more complex by paraphrasing |
How much will the temperature rise in a century? |
|
Distracting |
Questions made to confuse the retrieval part of the RAG with a distracting element from the knowledge base but irrelevant to the question |
Renewable energy are cheaper but how much will the global temperature rise by 2100? |
|
Situational |
Questions including user context to evaluate the ability of the generation to produce relevant answer according to the context |
I want to take personal actions to reduce my carbon footprint and I wonder how much will the global temperature rise by 2100? |
|
Double |
Questions with two distinct parts to evaluate the capabilities of the query rewriter of the RAG |
How much will the global temperature rise by 2100 and what is the main source of Greenhouse Gases? |
|
Conversational |
Questions made as part of a conversation, first message describe the context of the question that is ask in the last message, also tests the rewriter |
|
|