Open In Colab View Notebook on GitHub

RAG Evaluation Toolkit on an IPCC Climate AgentΒΆ

Giskard is an open-source framework for testing all ML models, from LLMs to tabular models. Don’t hesitate to give the project a star on GitHub ⭐️ if you find it useful!

In this notebook, you’ll learn how to create a test dataset for a RAG pipeline and use this dataset to test the model.

In this example, we illustrate the procedure using OpenAI Client that is the default one; however, please note that our platform supports a variety of language models. For details on configuring different models, visit our πŸ€– Setting up the LLM Client page

In this tutorial we will use Giskard LLM RAG Evaluation Toolkit to automatically detect issues of a Retrieval Augmented Generation (RAG) pipeline. We will test a model that answers questions about the IPCC report.

Use-case:

  • QA over the IPCC report

  • Foundational model: gpt-3.5-turbo

  • Context: IPCC report

Outline:

  • Create a test dataset for the RAG pipeline

  • Automatically evaluate the RAG pipeline and provide a report with recommendations

Install dependencies and setup notebookΒΆ

Let’s install the required dependencies. We will be using giskard[llm] to create the test dataset and llama-index to build the RAG pipeline. Additionally, we will use PyMuPDF to load the IPCC report.

[ ]:
!pip install "giskard[llm]" --upgrade
!pip install llama-index PyMuPDF

Next, we download the IPCC report and save it as ipcc_report.pdf.

[ ]:
!wget "https://www.ipcc.ch/report/ar6/syr/downloads/report/IPCC_AR6_SYR_LongerReport.pdf" -O "ipcc_report.pdf"

Now, we can import all of the required libraries and classess

[ ]:
import os
import warnings

import openai
import pandas as pd
from llama_index.core import VectorStoreIndex
from llama_index.core.base.llms.types import ChatMessage, MessageRole
from llama_index.core.node_parser import SentenceSplitter
from llama_index.readers.file import PyMuPDFReader

from giskard.rag import AgentAnswer, KnowledgeBase, QATestset, RAGReport, evaluate, generate_testset
from giskard.rag.metrics.ragas_metrics import ragas_context_precision, ragas_context_recall

Now, let’s set the OpenAI API Key environment variable and some visual options.

[ ]:
# Set the OpenAI API Key environment variable.
OPENAI_API_KEY = "..."
openai.api_key = OPENAI_API_KEY
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

# Set pandas options
pd.set_option("display.max_colwidth", 400)
warnings.filterwarnings("ignore")

Build RAG Agent on the IPCC reportΒΆ

We will use llama-index to build the RAG pipeline. We will use the VectorStoreIndex to create an index of the IPCC report. We will then use the as_chat_engine method to create a chat engine from the index.

Now, we can use the pyMuPDF reader to load the IPCC report and create a VectorStoreIndex. We will also use the SentenceSplitter to split the report into chunks of 512 tokens to ensure that the context is not too large.

[2]:
loader = PyMuPDFReader()
ipcc_documents = loader.load(file_path="./ipcc_report.pdf")
splitter = SentenceSplitter(chunk_size=512)
index = VectorStoreIndex.from_documents(ipcc_documents, transformations=[splitter])
chat_engine = index.as_chat_engine()

Underneath, you can see an image of a normal Retrieval Augmented Generation (RAG) pipeline. We phrase the question, the model generates an answer, and the context is retrieved from the vector store. This context is then passed to the model to generate the answer that is grounded relevant and up-to-date knowledge.

drawing

Let’s test the AgentΒΆ

We can now simple chat with our agent using the chat_engine and the chat method. Under the hood, this will use the VectorStoreIndex to retrieve the most relevant chunks of the report and the gpt-3.5-turbo model to answer the question.

[4]:
str(chat_engine.chat("How much will the global temperature rise by 2100?"))
[4]:
'The predicted global temperature rise by 2100 is 3.2 degrees Celsius, with a range of 2.2 to 3.5 degrees Celsius.'

Generate a test set on the IPCC reportΒΆ

Now our agent is ready to be tested. We can generate a test set using the generate_testset function. Before we do that, we need to create a giskard KnowledgeBase class based on splits within the ipcc_documents that we loaded earlier. We assign the text of each document to the knowledge_base_df dataframe and then create a KnowledgeBase class from it.

[ ]:
text_nodes = splitter(ipcc_documents)
knowledge_base_df = pd.DataFrame([node.text for node in text_nodes], columns=["text"])
knowledge_base = KnowledgeBase(knowledge_base_df)

Now, this KnowledgeBase class will be used to generate a test set. We will generate 120 questions and use the agent_description to describe the agent. This will be used to generate a test set that is representative of the agent’s performance.

[ ]:
testset = generate_testset(
    knowledge_base, num_questions=120, agent_description="A chatbot answering questions about the IPCC report"
)

To avoid losing the test set, we can save it to a JSONL file and safely load it later. Note that, we need to ensure the documents in the KnowledgeBase are the same as the ones in the testset to evaluate the agent’s performance on this test set.

[7]:
# Save the testset
testset.save("ipcc_testset.jsonl")

# Load the testset
testset = QATestset.load("ipcc_testset.jsonl")

Let’s take a look at the first 5 questions in the test set. We can see that the questions are representative of the agent’s performance and get a good coverage of the IPCC report.

[8]:
testset.to_pandas().head(5)
[8]:
question reference_answer reference_context conversation_history metadata
id
1cacb231-b6e3-44aa-a315-79aa43cff369 When is the best estimate of reaching 1.5°C of global warming according to most scenarios? The best estimate of reaching 1.5°C of global warming lies in the first half of the 2030s in most of the considered scenarios and modelled pathways. Document 116: The best estimate of reaching 1.5°C of global \nwarming lies in the first half of the 2030s in most of the considered \nscenarios and modelled pathways114. In the very low GHG emissions \nscenario (SSP1-1.9), CO2 emissions reach net zero around 2050 and the \nbest-estimate end-of-century warming is 1.4°C, after a temporary overshoot \n(see Section 3.3.4) of no more than 0.1°C abov... [] {'question_type': 'simple', 'seed_document_id': 116, 'topic': 'Climate Change Mitigation Scenarios'}
d785c257-4c44-443c-99dd-7d72a296da9f What are the projected global emissions for 2030 based on policies implemented by the end of 2020? The median projected global emissions for 2030 based on policies implemented by the end of 2020 are 57 GtCO2-eq/yr, with a range of 52–60 GtCO2-eq/yr. Document 82: Emissions projections for 2030 and gross differences in emissions are based on emissions of 52–56 GtCO2-eq yr–1 in 2019 as assumed in underlying model \nstudies97. (medium confidence) {WGIII Table SPM.1} (Table 3.1, Cross-Section Box.2) \n95 \nAbatement here refers to human interventions that reduce the amount of GHGs that are released from fossil fuel infrastructure to the atmosph... [] {'question_type': 'simple', 'seed_document_id': 82, 'topic': 'Global Greenhouse Gas Emissions and Climate Policy'}
0646700a-9617-4dad-9a12-f84a6048ca9d What are some key barriers to the implementation of adaptation options in vulnerable sectors? Key barriers include limited resources, lack of private-sector and civic engagement, insufficient mobilisation of finance, lack of political commitment, limited research and/or slow and low uptake of adaptation science, and a low sense of urgency. Document 95: 62\nSection 2\nSection 1\nSection 2\nfire-adapted ecosystems, or hard defences against flooding) and human \nsettlements (e.g. stranded assets and vulnerable communities that \ncannot afford to shift away or adapt and require an increase in social \nsafety nets). Maladaptation especially affects marginalised and vulnerable \ngroups adversely (e.g., Indigenous Peoples, ethnic minorit... [] {'question_type': 'simple', 'seed_document_id': 95, 'topic': 'Others'}
0d78955c-f9c8-41ad-9ba4-a2670da4e63c What are some irreversible changes projected due to continued GHG emissions? Many changes due to past and future GHG emissions are irreversible on centennial to millennial time scales, especially in the ocean, ice sheets, and global sea level. Document 118: {WGI SPM D.1.7, WGI Box TS.7} (Cross-Section Box.2)\nContinued GHG emissions will further affect all major climate \nsystem components, and many changes will be irreversible on \ncentennial to millennial time scales. Many changes in the climate \nsystem become larger in direct relation to increasing global warming. \nWith every additional increment of global warming, changes in \... [] {'question_type': 'simple', 'seed_document_id': 118, 'topic': 'Others'}
00d34731-7f09-446d-80fb-40c0b20b547a What are some options for scaling up mitigation and adaptation in developing regions according to the context? Options include increased levels of public finance and publicly mobilised private finance flows from developed to developing countries, increasing the use of public guarantees to reduce risks and leverage private flows at lower cost, local capital markets development, and building greater trust in international cooperation processes. Document 291: Accelerated support \nfrom developed countries and multilateral institutions is a critical \nenabler to enhance mitigation and adaptation action and can address \ninequities in finance, including its costs, terms and conditions, and \neconomic vulnerability to climate change. Scaled-up public grants for \nmitigation and adaptation funding for vulnerable regions, e.g., in Sub-\nSah... [] {'question_type': 'simple', 'seed_document_id': 291, 'topic': 'Climate Change Mitigation and Adaptation'}

Evaluate and Diagnose the AgentΒΆ

We can now evaluate the agent’s performance on the test set using the RAG Evaluation Toolkit (RAGET). We will use the evaluate function to evaluate the agent’s performance on the test set. We will use the ragas_context_recall and ragas_context_precision metrics to evaluate the agent’s performance on the test set. We will also use the RAGReport class to generate a report of the agent’s performance.

[ ]:
def answer_fn(question: str, history: list[dict] = None) -> AgentAnswer:
    if history:
        answer = chat_engine.chat(
            question,
            chat_history=[
                ChatMessage(
                    role=MessageRole.USER if msg["role"] == "user" else MessageRole.ASSISTANT, content=msg["content"]
                )
                for msg in history
            ],
        )
    else:
        answer = chat_engine.chat(question, chat_history=[])

    return AgentAnswer(message=answer.response, documents=[source.content for source in answer.sources])


report = evaluate(
    answer_fn, testset=testset, knowledge_base=knowledge_base, metrics=[ragas_context_recall, ragas_context_precision]
)

Now, we can save the report and load it later to display it.

[11]:
# Save the RAG report
report.save("ipcc_report")

# Load the RAG report
report = RAGReport.load("ipcc_report")

We can also share the report with others to get their feedback.

[12]:
display(report.to_html(embed=True))

RAGET question typesΒΆ

For RAGET, we have 6 different question types that assess different RAG components. Each question type assesses a few RAG components. This makes it possible to localize weaknesses in the RAG Agent and give feedback to the developers.

Question type

Description

Example

Targeted RAG components

Simple

Simple questions generated from an excerpt of the knowledge base

How much will the global temperature rise by 2100?

Generator, Retriever

Complex

Questions made more complex by paraphrasing

How much will the temperature rise in a century?

Generator

Distracting

Questions made to confuse the retrieval part of the RAG with a distracting element from the knowledge base but irrelevant to the question

Renewable energy are cheaper but how much will the global temperature rise by 2100?

Generator, Retriever, Rewriter

Situational

Questions including user context to evaluate the ability of the generation to produce relevant answer according to the context

I want to take personal actions to reduce my carbon footprint and I wonder how much will the global temperature rise by 2100?

Generator

Double

Questions with two distinct parts to evaluate the capabilities of the query rewriter of the RAG

How much will the global temperature rise by 2100 and what is the main source of Greenhouse Gases?

Generator, Rewriter

Conversational

Questions made as part of a conversation, first message describe the context of the question that is ask in the last message, also tests the rewriter

  • I want to know more about the global temperature evolution by 2100. - How high will it be?

Rewriter, Routing