Open In Colab View Notebook on GitHub

RAG Evaluation Toolkit on a Banking Supervisory Process AgentΒΆ

Before startingΒΆ

In this notebook, we will build a Banking Supervisory Process Agent using llama_index and gpt-3.5-turbo model. Then, we use giskard to evalute the model itself and also the RAG system.

To perform these evaluations, we use features such as scan, generate_testset and evaluate, which require an LLM client. By default, the client is set to use OpenAI’s models (e.g. gpt-4 and text-embedding-ada-002). If you want to use another provider (e.g. Ollama, Gemini, Azure, etc.) or change the models, please refer to Setting up the LLM Client for more information.

Install dependencies and download the Banking Supervision reportΒΆ

[ ]:
!pip install "giskard[llm]" --upgrade
!pip install llama-index PyMuPDF
[ ]:
!wget "https://www.bankingsupervision.europa.eu/ecb/pub/pdf/ssm.supervisory_guides202401_manual.en.pdf" -O "banking_supervision_report.pdf"

Build RAG Agent on the Banking Supervision reportΒΆ

[1]:
import pandas as pd
import warnings
pd.set_option("display.max_colwidth", 400)
warnings.filterwarnings('ignore')
[2]:
from llama_index.core import VectorStoreIndex
from llama_index.core.node_parser import SentenceSplitter
from llama_index.readers.file import PyMuPDFReader
from llama_index.core.base.llms.types import ChatMessage, MessageRole
from llama_index.llms.openai import OpenAI

loader = PyMuPDFReader()
documents = loader.load(file_path="./banking_supervision_report.pdf")
llm = OpenAI(model="gpt-3.5-turbo", temperature=0)
[3]:
splitter = SentenceSplitter(chunk_size=512)
index = VectorStoreIndex.from_documents(documents, transformations=[splitter])
chat_engine = index.as_chat_engine(llm=llm)

Let’s test the AgentΒΆ

[4]:
str(chat_engine.chat("What is SSM?"))
[4]:
'SSM stands for Single Supervisory Mechanism.'

Scan LLM vulnerabilitiesΒΆ

As a first step, we will run a scan on the chatbot model. This will help us identify the potential vulnerabilities in the model that the agent is built on.

[5]:
def model_predict(df: pd.DataFrame):
    return [chat_engine.chat(question).response for question in df["question"]]
[ ]:
from giskard import Model

giskard_model = Model(
    model=model_predict,
    model_type="text_generation",
    name="Banking Supervision Question Answering",
    description="A model that answers questions about ECB Banking Supervision report",
    feature_names=["question"]
)
[ ]:
from giskard import scan

scan_report = scan(giskard_model)
[8]:
# Scan report
display(scan_report)

Generate a test set on the Banking Supervision reportΒΆ

[ ]:
from giskard.rag import KnowledgeBase, generate_testset, QATestset

text_nodes = splitter(documents)
knowledge_base_df = pd.DataFrame([node.text for node in text_nodes], columns=["text"])
knowledge_base = KnowledgeBase(knowledge_base_df)
[ ]:
testset = generate_testset(knowledge_base,
                           num_questions=100,
                           agent_description="A chatbot answering questions about banking supervision procedures and methodologies.",
                           language="en")
[11]:
# Save the testset
testset.save("banking_supervision_testset.jsonl")

# Load the testset
testset = QATestset.load("banking_supervision_testset.jsonl")
[12]:
testset.to_pandas().head(5)
[12]:
question reference_answer reference_context conversation_history metadata
id
35202be3-9120-4bd1-9b3b-722d3b307e1c What is the role of Joint Supervisory Teams (JSTs) in the supervision of Significant Institutions (SIs)? The day-to-day supervision of SIs is primarily conducted off-site by the JSTs, which comprise staff from NCAs and the ECB and are supported by the horizontal and specialised expertise divisions of DG/HOL and similar staff at the NCAs. The JST analyses the supervisory reporting, financial statements and internal documentation of supervised entities, holds regular and ad hoc meetings with the su... Document 76: This can involve on-site interventions at supervised institutions, if needed. \nDepending on a specific bank’s risk profile assessment, the ECB may impose a wide \nrange of supervisory measures. \n2.3.1 \nJoint Supervisory Teams \nThe day-to-day supervision of SIs is primarily conducted off-site by the JSTs, which \ncomprise staff from NCAs and the ECB and are supported by the hor... [] {'question_type': 'simple', 'seed_document_id': 76, 'topic': 'Others'}
1beb42a0-ff1a-42e9-91c6-fe11774e909d What happens if an urgent supervisory decision is necessary to prevent significant damage to the financial system? The ECB may adopt a supervisory decision which would adversely affect the rights of the addressee without giving it the opportunity to comment on the decision prior to its adoption. In this case, the hearing is postponed, and a clear justification is provided in the decision as to why the postponement is necessary. The hearing is then organised as soon as possible after the adoption of the dec... Document 34: Supervisory Manual – Functioning of the Single Supervisory Mechanism \n \n21 \nFigure 4 \nDecision-making process \n \n*The deadline for submitting comments/objections in a written procedure is five working days, while the deadline for non-objection \nprocedures is a maximum of ten working days. \n**The applicable legal deadlines for each specific case must be taken into account. ... [] {'question_type': 'simple', 'seed_document_id': 34, 'topic': 'Single Supervisory Mechanism'}
562d7352-b2ee-4191-b6eb-96f0fca7b01c What is required of banks and investment firms in the EU that are subsidiaries of third-country groups according to Article 21b of Directive 2013/36/EU? Article 21b of Directive 2013/36/EU requires banks and investment firms in the EU that are subsidiaries of third-country groups to set up a single intermediate EU parent undertaking if the third-country group has two or more institutions established within the EU with a combined total asset value of at least €40 billion. Document 169: Supervisory Manual – Supervision of significant institutions \n \n97 \ntransactions which go beyond the contractual obligations of a sponsor institution or \nan originator institution under Article 248(1) of Regulation (EU) No 575/2013. \nBased on the notifications received from significant institutions: \nβ€’ \nif the institution declares that there is implicit support, the JST ch... [] {'question_type': 'simple', 'seed_document_id': 169, 'topic': 'Others'}
a9955bdc-165d-42ed-a259-53bef0d5e0ea What are the purposes of macroprudential extensions in stress tests? Macroprudential extensions in stress tests focus on system-wide effects rather than on individual banks and are run in a top-down manner. They capture important feedback effects or network effects, which can occur through adverse changes in the state of the environment triggered by a stress scenario with a negative impact on lending or through lending or funding links between institutions. Document 125: These tasks are undertaken, where \nappropriate, in collaboration with other divisions of the ECB, the EBA and/or NCAs. \nMicroprudential stress tests are often complemented by macroprudential extensions \nthat focus on system-wide effects rather than on individual banks and which are run \nin a top-down manner, meaning that they do not involve the supervised entities. In \nparti... [] {'question_type': 'simple', 'seed_document_id': 125, 'topic': 'European Banking Supervision'}
a7c255f1-9fd8-48d8-8a6a-5afa995dae21 What happens if a quorum of 50% is not met during an emergency Supervisory Board meeting? If a quorum of 50% in the Supervisory Board for emergency situations is not met, the meeting will be closed and an extraordinary meeting will be held soon afterwards. Document 38: Supervisory Manual – Functioning of the Single Supervisory Mechanism \n \n24 \nβ€’ \nif an NCA which is concerned by the decision has different views regarding the \nobjection, the NCA may request mediation; \nβ€’ \nif no request for mediation is submitted, the Supervisory Board may amend the \ndraft decision in order to incorporate the comments of the Governing Council; \nβ€’ \nif the ... [] {'question_type': 'simple', 'seed_document_id': 38, 'topic': 'Single Supervisory Mechanism'}

Evaluate and Diagnose the AgentΒΆ

Now, we focus on evaluating the agent’s performance on the Banking Supervision report. We will use the RAG evaluation toolkit (RAGET) to evaluate the agent’s performance and diagnose the issues.

[13]:
from giskard.rag import evaluate, RAGReport, AgentAnswer
from giskard.rag.metrics.ragas_metrics import ragas_context_recall, ragas_context_precision
[ ]:
def answer_fn(question, history=None):
    if history:
        answer = chat_engine.chat(
            question,
            chat_history=[
                ChatMessage(
                    role=MessageRole.USER if msg["role"] == "user" else MessageRole.ASSISTANT,
                    content=msg["content"]
                ) for msg in history
            ]
        )
    else:
        answer = chat_engine.chat(question, chat_history=[])

    return AgentAnswer(
        message=answer.response,
        documents=[source.content for source in answer.sources]
    )

rag_report = evaluate(
    answer_fn,
    testset=testset,
    knowledge_base=knowledge_base,
    metrics=[ragas_context_recall, ragas_context_precision]
)

[15]:
# Save the RAG report
rag_report.save("banking_supervision_report")

# Load the RAG report
rag_report = RAGReport.load("banking_supervision_report")
[16]:
# RAG report
display(rag_report.to_html(embed=True))

RAGET question typesΒΆ

Each question type assesses a few RAG components. This makes it possible to localize weaknesses in the RAG Agent and give feedback to the developers.

Question type

Description

Example

Targeted RAG components

Simple

Simple questions generated from an excerpt of the knowledge base

What is the purpose of the holistic approach in the SREP?

Generator, Retriever

Complex

Questions made more complex by paraphrasing

In what capacity and with what frequency do NCAs contribute to the formulation and scheduling of supervisory activities, especially concerning the organization of on-site missions?

Generator

Distracting

Questions made to confuse the retrieval part of the RAG with a distracting element from the knowledge base but irrelevant to the question

Under what conditions does the ECB levy fees to cover the costs of its supervisory tasks, particularly in the context of financial conglomerates requiring cross-sector supervision?

Generator, Retriever, Rewriter

Situational

Questions including user context to evaluate the ability of the generation to produce relevant answer according to the context

As a bank manager looking to understand the appeal process for a regulatory decision made by the ECB, could you explain what role the ABoR plays in the supervisory decision review process?

Generator

Double

Questions with two distinct parts to evaluate the capabilities of the query rewriter of the RAG

What role does the SSM Secretariat Division play in the decision-making process of the ECB’s supervisory tasks, and which directorates general are involved in the preparation of draft decisions for supervised entities in the ECB Banking Supervision?

Generator, Rewriter

Conversational

Questions made as part of a conversation, first message describe the context of the question that is ask in the last message, also tests the rewriter

  • I am interested in the sources used for the assessment of risks and vulnerabilities in ECB Banking Supervision. -

What are these sources?

Rewriter, Routing