RAG Evaluation Toolkit on an IPCC Climate Agent¶

Install dependencies and download the IPCC report¶

[ ]:

!pip install "giskard[llm]" --upgrade
!pip install llama-index PyMuPDF

[ ]:

!wget "https://www.ipcc.ch/report/ar6/syr/downloads/report/IPCC_AR6_SYR_LongerReport.pdf" -O "ipcc_report.pdf"

The testset generation can take up to several minutes. If you want to test RAGET on the IPCC report, we prepared a demo testset along with the evaluation report of the RAG model build in this notebook. These can be downloaded with the following command:

[ ]:

!git clone https://github.com/Giskard-AI/raget_demo.git

Build RAG Agent on the IPCC report¶

[1]:

import pandas as pd
import warnings
pd.set_option("display.max_colwidth", 400)
warnings.filterwarnings('ignore')

[2]:

from llama_index.core import VectorStoreIndex
from llama_index.core.node_parser import SentenceSplitter
from llama_index.readers.file import PyMuPDFReader
from llama_index.core.base.llms.types import ChatMessage, MessageRole

loader = PyMuPDFReader()
ipcc_documents = loader.load(file_path="./ipcc_report.pdf")

[3]:

splitter = SentenceSplitter(chunk_size=512)
index = VectorStoreIndex.from_documents(ipcc_documents, transformations=[splitter])
chat_engine = index.as_chat_engine()

drawing

Let’s test the Agent¶

[4]:

str(chat_engine.chat("How much will the global temperature rise by 2100?"))

[4]:

'The projected global temperature increase by the year 2100 is 3.2 degrees Celsius, with a range of 2.2 to 3.5 degrees Celsius.'

Generate a test set on the IPCC report¶

[5]:

from giskard.rag import KnowledgeBase, generate_testset, QATestset

text_nodes = splitter(ipcc_documents)
knowledge_base_df = pd.DataFrame([node.text for node in text_nodes], columns=["text"])
knowledge_base = KnowledgeBase(knowledge_base_df)

[ ]:

testset = generate_testset(knowledge_base,
                           num_questions=120,
                           agent_description="A chatbot answering questions about the IPCC report")

[6]:

testset = QATestset.load("ipcc_testset.jsonl")

[ ]:

testset.to_pandas().head(5)

	question	reference_answer	reference_context	conversation_history	metadata
id
450623f7-e644-4bfa-88d5-90f31dd15d99	What are the consequences of global warming exceeding 2°C for climate resilient development in some regions and sub-regions?	Climate resilient development will not be possible in some regions and sub-regions if global warming exceeds 2°C.	Document 196: Accelerated and equitable mitigation and adaptation bring beneﬁts from avoiding damages from climate \nchange and are critical to achieving sustainable development (high conﬁdence). Climate resilient development138 \npathways are progressively constrained by every increment of further warming (very high conﬁdence). There is a \nrapidly closing window of opportunity to secure a li...	[]	{'question_type': 'simple', 'seed_document_id': 196, 'topic': 'Climate Change Action'}
79f98d3d-766b-4cbf-800f-03e87966e3e5	What is the projected decline in coral reefs with a global warming of 1.5°C?	Coral reefs are projected to decline by a further 70–90% at 1.5°C of global warming.	Document 123: 71\nLong-Term Climate and Development Futures\nSection 3\n3.1.2 Impacts and Related Risks\nFor a given level of warming, many climate-related risks are \nassessed to be higher than in AR5 (high conﬁdence). Levels of \nrisk120 for all Reasons for Concern121 (RFCs) are assessed to become high \nto very high at lower global warming levels compared to what was \nassessed in AR5 (high...	[]	{'question_type': 'simple', 'seed_document_id': 123, 'topic': 'Climate Change Risks'}
1ee224a2-62af-4877-b172-baec006512e6	What is the expected uncertainty range in the total potential for mitigation options according to the IPCC report?	The uncertainty in the total potential is typically 25–50%.	Document 251: Where a gradual colour transition is shown, the breakdown of the potential into cost categories is not well known or depends heavily on factors such \nas geographical location, resource availability, and regional circumstances, and the colours indicate the range of estimates. The uncertainty in the total potential is typically 25–50%. \nWhen interpreting this ﬁgure, the following...	[]	{'question_type': 'simple', 'seed_document_id': 251, 'topic': 'Climate Change Action'}
16264bd2-510a-4368-a9d6-0a5fef7feb65	What is the effect of increasing cumulative net CO2 emissions on the effectiveness of natural land and ocean carbon sinks?	The proportion of emissions taken up by land and ocean decreases with increasing cumulative net CO2 emissions.	Document 166: While \nnatural land and ocean carbon sinks are projected to take up, in absolute \nterms, a progressively larger amount of CO2 under higher compared to \nlower CO2 emissions scenarios, they become less effective, that is, the \nproportion of emissions taken up by land and ocean decreases with \nincreasing cumulative net CO2 emissions (high conﬁdence). Additional \necosystem resp...	[]	{'question_type': 'simple', 'seed_document_id': 166, 'topic': 'Climate Change Projections'}
c31c6857-c505-45ef-98e5-aa524c4b05e7	What does hatching represent on the maps depicting changes in maize yield and fisheries catch potential?	Hatching indicates areas where less than 70% of the climate-crop model combinations agree on the sign of impact for maize yield, and where the two climate-fisheries models disagree in the direction of change for fisheries catch potential.	Document 135: Interquartile ranges of WGLs by 2081–2100 \nunder RCP2.6, RCP4.5 and RCP8.5. The presented index is consistent with common features found in many indices included within WGI and WGII assessments. (c) Impacts \non food production: (c1) Changes in maize yield at projected GWLs of 1.6°C to 2.4°C (2.0°C), 3.3°C to 4.8°C (4.1°C) and 3.9°C to 6.0°C (4.9°C). Median yield changes \nfrom ...	[]	{'question_type': 'simple', 'seed_document_id': 135, 'topic': 'Climate Change Assessment'}

Evaluate and Diagnose the Agent¶

[8]:

from giskard.rag import evaluate, RAGReport
from giskard.rag.metrics.ragas_metrics import ragas_context_recall, ragas_context_precision

[9]:

def answer_fn(question, history=None):
    if history:
        answer = chat_engine.chat(question, chat_history=[ChatMessage(role=MessageRole.USER if msg["role"] =="user" else MessageRole.ASSISTANT,
                                                          content=msg["content"]) for msg in history])
    else:
        answer = chat_engine.chat(question, chat_history=[])
    return str(answer)

report = evaluate(answer_fn,
                testset=testset,
                knowledge_base=knowledge_base,
                metrics=[ragas_context_recall, ragas_context_precision])

[9]:

report = RAGReport.load("ipcc_report")

[10]:

display(report.to_html(embed=True))

[11]:

report.save("ipcc_report")

RAGET question types¶

Each question type assesses a few RAG components. This makes it possible to localize weaknesses in the RAG Agent and give feedback to the developers.

Question type	Description	Example	Targeted RAG components
Simple	Simple questions generated from an excerpt of the knowledge base	How much will the global temperature rise by 2100?	`Generator`, `Retriever`
Complex	Questions made more complex by paraphrasing	How much will the temperature rise in a century?	`Generator`
Distracting	Questions made to confuse the retrieval part of the RAG with a distracting element from the knowledge base but irrelevant to the question	Renewable energy are cheaper but how much will the global temperature rise by 2100?	`Generator`, `Retriever`, `Rewriter`
Situational	Questions including user context to evaluate the ability of the generation to produce relevant answer according to the context	I want to take personal actions to reduce my carbon footprint and I wonder how much will the global temperature rise by 2100?	`Generator`
Double	Questions with two distinct parts to evaluate the capabilities of the query rewriter of the RAG	How much will the global temperature rise by 2100 and what is the main source of Greenhouse Gases?	`Generator`, `Rewriter`
Conversational	Questions made as part of a conversation, first message describe the context of the question that is ask in the last message, also tests the rewriter	I want to know more about the global temperature evolution by 2100. - How high will it be?	`Rewriter`, `Routing`

Upload the test set on Giskard Hub¶

[26]:

from giskard import GiskardClient
import os

client = GiskardClient(
    url="http://localhost:19000",  # Option 1: Use URL of your local Giskard instance.
    key=os.getenv("GSK_HUB_KEY"),
)

suite = testset.to_test_suite(name="IPCC Test Suite")
suite.upload(client, project_key="ipcc")

Dataset successfully uploaded to project key 'ipcc' with ID = fb0fb983-91c8-4555-9e0e-cb6a34e2fcd2
Test suite has been saved: http://localhost:19000/main/projects/ipcc/testing/suite/365

[26]:

<giskard.core.suite.Suite at 0x298fff970>