MLflow Example - LLM - Databricks¶
In this tutorial we will use Giskard LLM Scan to automatically detect issues on a Retrieval Augmented Generation (RAG) task. For simplicity, we consider a simple langchain
model. Here’s a snapshot of the report that the integration can produce, but more on that later.
Question Answering over the IPCC Climate Change Report¶
In order to highlight the MLflow-Giskard integration and how it can help debug LLMs, we will walk through a practical use case of using the Giskard LLM Scan and the MLflow evaluate
API on a Retrieval Augmented Generation (RAG) task: Question Answering based on the 2023 Climate Change Synthesis Report by the IPCC.
In the following example, we illustrate the procedure using OpenAI Client that is the default one; however, please note that our platform supports a variety of language models. For details on configuring different models, visit our 🤖 Setting up the LLM Client page
By the end of this notebook, we should be able to achieve the following goals:
Evaluate the
langchain
models powered bygpt-3.5-turbo-instruct
(read this for details) andgpt-4
.Compare the results using the MLflow user-interface.
Prerequisites 🔧¶
To begin the setup, make sure to install the following packages:
[ ]:
%pip install mlflow "giskard[llm]" -q
[ ]:
%pip install langchain langchain-openai openai "pypdf<=3.17.0" faiss-cpu tiktoken pyngrok -q
After completing the installation process, you will be able to observe giskard as part of MLflow’s evaluators:
[ ]:
import mlflow
import giskard
mlflow.models.list_evaluators() # ['default', 'giskard']
Finally, let us configure the OpenAI ChatGPT API key:
[ ]:
import os
# See https://help.openai.com/en/articles/4936850-where-do-i-find-my-secret-api-key
os.environ['OPENAI_API_KEY'] = "sk-XXX"
Loading the IPCC report into a vector database¶
Let us start by processing the IPCC climate change report from a PDF into a vector database. For this, we first use the pypdf
library to load process the PDF into array of documents, where each document contains the page content and metadata with page number.
Second, we process these documents by the means of the RecursiveCharacterTextSplitter
, which takes a large text and splits it based on a specified chunk size. It does this by using a set of characters. The default characters provided to it are ["\n\n", "\n", " ", ""]
(see this article for more details).
Finally, we use the Facebook AI Similarity Search (FAISS), a library for efficient similarity search and clustering of dense vectors, to get the vector database. FAISS contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM. It also contains supporting code for evaluation and parameter tuning.
[ ]:
from langchain.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Load the IPCC Climate Change Synthesis Report from a PDF file
loader = PyPDFLoader(
"https://www.ipcc.ch/report/ar6/syr/downloads/report/IPCC_AR6_SYR_LongerReport.pdf"
)
# Split document
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=100,
length_function=len,
add_start_index=True,
)
# Load the splitted fragments in our vector store
docs = loader.load_and_split(text_splitter)
db = FAISS.from_documents(docs, OpenAIEmbeddings())
The RAG’s prompt 🔃¶
The prompt we use in order to define the RAG is quite a minimalistic one defined by the following:
[ ]:
from langchain.prompts import PromptTemplate
# We use a simple prompt
PROMPT_TEMPLATE = """You are the Climate Assistant, a helpful AI assistant made by Giskard.
Your task is to answer common questions on climate change.
You will be given a question and relevant excerpts from the IPCC Climate Change Synthesis Report (2023).
Please provide short and clear answers based on the provided context. Be polite and helpful.
Context:
{context}
Question:
{question}
Your answer:
"""
prompt = PromptTemplate(template=PROMPT_TEMPLATE, input_variables=["question", "context"])
Initialization of the LLMs 🦜¶
We can now create the two langchain
models powered by gpt-3.5-turbo-instruct
and gpt-4
. In order to facilitate the organization and retrieval of the different models and results, we will create a small dictionary that will map each foundational model’s name into the the langchain model object
[ ]:
chains = {"gpt-3.5-turbo-instruct": None, "gpt-4": None}
models = {"gpt-3.5-turbo-instruct": None, "gpt-4": None}
Using the RetrievalQA
from langchain
we can now instantiate the models. We set the temperature
to zero in order to reduce the randomness of the LLM’s output. We also point the retriever to the vector database thatwe loaded earlier.
[ ]:
from langchain_openai import ChatOpenAI, OpenAI
from langchain.chains import RetrievalQA
for model_name in models.keys():
llm = ChatOpenAI(model=model_name, temperature=0) if model_name == "gpt-4" else OpenAI(model=model_name, temperature=0)
chains[model_name] = RetrievalQA.from_llm(llm=llm, retriever=db.as_retriever(), prompt=prompt)
models[model_name] = lambda df: [chains[model_name].invoke(row["query"])['result'] for _, row in df.iterrows()]
Testing the implementation ⚡¶
With the current setup, we have everything we need to test the RAG. For this we send a question to the RAG wrapped in a pandas.DataFrame
, which will become handy in a few steps.
[ ]:
import pandas as pd
df_example = pd.DataFrame({
"query": [
"According to the IPCC report, what are key risks in the Europe?",
"Is sea level rise avoidable? When will it stop?"
]
})
print(models["gpt-3.5-turbo-instruct"](df_example.tail(1)))
It’s working! The answer is coherent with what is stated in the report:
Sea level rise is unavoidable for centuries to millennia due to continuing deep ocean warming and ice sheet melt, and sea levels will remain elevated for thousands of years
(2023 Climate Change Synthesis Report, page 77)
Evaluation 🔬¶
With the RAG now in place, we are ready to move forward with evaluating and comparing the LLMs. First, let us make sure to define the evaluation configuration. These are meta-data needed by Giskard to run the scan:
[ ]:
evaluator_config={
"model_config":
{"name": "Climate Change Question Answering",
"description": "This model answers any question about climate change based on IPCC reports",
"feature_names": ["query"],},
}
Before running the evaluation, based on the following MLflow tutorial, we first need to persist the vectore database that we previously loaded and then define a retriever function that communicates with it.
Once this is done, we can log the langchain
model into mlflow
following this documentation. We can then finally proceed with the evaluation of each LLM separately using the Giskard evaluator.
[ ]:
for model_name in models.keys():
with mlflow.start_run(run_name=model_name):
mlflow.evaluate(model=models[model_name],
model_type="question-answering",
data=df_example,
evaluators="giskard", # <-- where the magic happens
evaluator_config=evaluator_config)
After completing these steps, MLflow will generate a folder named mlruns that contains all the results… And… That’s it! 🎊 Let us now analyse the results.
Results 📊¶
After completing the previous steps, if you are on google colab, you can run the following cell to visualize the results. We employ ngrok
to establish an https
tunnel on port 5000
, which is used by the MLflow UI. There, you will find the two LLMs logged as separate runs for comparison and analysis.
[ ]:
from pyngrok import ngrok
# Get your authtoken from https://dashboard.ngrok.com/auth
NGROK_AUTH_TOKEN = "NGROK_TOKEN"
ngrok.set_auth_token(NGROK_AUTH_TOKEN)
# Run the tracking UI in the background
get_ipython().system_raw("mlflow ui --port 5000 &")
# Terminate open tunnels if exist
ngrok.kill()
# Open an HTTPs tunnel on port 5000 for http://localhost:5000
ngrok_tunnel = ngrok.connect(addr="5000", proto="http", bind_tls=True)
print("MLflow Tracking UI:", ngrok_tunnel.public_url)
To get an idea on the type of results you can visualise, we show in the following screenshot examples of the MLflow Tracking Server on Databricks.
You will find first the two LLMs logged as separate runs for comparison and analysis.
The Giskard plugin will log three primary outcomes per run onto MLflow: a scan HTML report showcasing all discovered hidden vulnerabilities, the metrics produced by the scan, and a standardized scan JSON file facilitating comparisons across various runs.
The Giskard scan was able to identify 10 potential issues with the gpt-3.5-turbo-instruct based LLM. These fall under the hallucination, harmfulness, sensitive information disclosure and prompt injection categories.
After each model evaluation, a scan-summary.json file is created, enabling a comparison of vulnerabilities and metrics for each model in the “Evaluation” view of MLflow.
Finally, in conjunction with the qualitative scan report, Giskard automatically maps most of the vulnerabilities found into quantitative metrics as follows: