Open In Colab View Notebook on GitHub

MLflow Example - LLM

In this tutorial we will use Giskard LLM Scan to automatically detect issues on a Retrieval Augmented Generation (RAG) task. For simplicity, we consider a simple langchain model.

Question Answering over the IPCC Climate Change Report

In order to highlight the MLflow-Giskard integration and how it can help debug LLMs, we will walk through a practical use case of using the Giskard LLM Scan and the MLflow evaluate API on a Retrieval Augmented Generation (RAG) task: Question Answering based on the 2023 Climate Change Synthesis Report by the IPCC.

In the following example, we illustrate the procedure using OpenAI Client that is the default one; however, please note that our platform supports a variety of language models. For details on configuring different models, visit our 🤖 Setting up the LLM Client page

By the end of this notebook, we should be able to achieve the following goals:

  • Evaluate the langchain models powered by gpt-3.5-turbo-instruct (read this for details) and gpt-4.

  • Compare the results using the MLflow user-interface.

Prerequisites 🔧

To begin the setup, make sure to install the following packages:

[ ]:
%pip install mlflow "giskard[llm]" -q
[ ]:
%pip install langchain langchain-openai openai "pypdf<=3.17.0" faiss-cpu tiktoken pyngrok -q

After completing the installation process, you will be able to observe giskard as part of MLflow’s evaluators:

[ ]:
import mlflow
import giskard

mlflow.models.list_evaluators() # ['default', 'giskard']

Finally, let us configure the OpenAI ChatGPT API key:

[ ]:
import os
# See
os.environ['OPENAI_API_KEY'] = "sk-XXX"

Loading the IPCC report into a vector database

Let us start by processing the IPCC climate change report from a PDF into a vector database. For this, we first use the pypdf library to load process the PDF into array of documents, where each document contains the page content and metadata with page number.

Second, we process these documents by the means of the RecursiveCharacterTextSplitter, which takes a large text and splits it based on a specified chunk size. It does this by using a set of characters. The default characters provided to it are ["\n\n", "\n", " ", ""] (see this article for more details).

Finally, we use the Facebook AI Similarity Search (FAISS), a library for efficient similarity search and clustering of dense vectors, to get the vector database. FAISS contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM. It also contains supporting code for evaluation and parameter tuning.

[ ]:
from langchain.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Load the IPCC Climate Change Synthesis Report from a PDF file
loader = PyPDFLoader(

# Split document
text_splitter = RecursiveCharacterTextSplitter(

# Load the splitted fragments in our vector store
docs = loader.load_and_split(text_splitter)
db = FAISS.from_documents(docs, OpenAIEmbeddings())

The RAG’s prompt 🔃

The prompt we use in order to define the RAG is quite a minimalistic one defined by the following:

[ ]:
from langchain.prompts import PromptTemplate

# We use a simple prompt
PROMPT_TEMPLATE = """You are the Climate Assistant, a helpful AI assistant made by Giskard.
Your task is to answer common questions on climate change.
You will be given a question and relevant excerpts from the IPCC Climate Change Synthesis Report (2023).
Please provide short and clear answers based on the provided context. Be polite and helpful.



Your answer:

prompt = PromptTemplate(template=PROMPT_TEMPLATE, input_variables=["question", "context"])

Initialization of the LLMs 🦜

We can now create the two langchain models powered by gpt-3.5-turbo-instruct and gpt-4 . In order to facilitate the organization and retrieval of the different models and results, we will create a small dictionary that will map each foundational model’s name into the the langchain model object

[ ]:
chains = {"gpt-3.5-turbo-instruct": None, "gpt-4": None}
models = {"gpt-3.5-turbo-instruct": None, "gpt-4": None}

Using the RetrievalQA from langchain we can now instantiate the models. We set the temperature to zero in order to reduce the randomness of the LLM’s output. We also point the retriever to the vector database thatwe loaded earlier.

[ ]:
from langchain_openai import ChatOpenAI, OpenAI
from langchain.chains import RetrievalQA

for model_name in models.keys():
  llm = ChatOpenAI(model=model_name, temperature=0) if model_name == "gpt-4" else OpenAI(model=model_name, temperature=0)
  chains[model_name] = RetrievalQA.from_llm(llm=llm, retriever=db.as_retriever(), prompt=prompt)
  models[model_name] = lambda df: [chains[model_name].invoke(row["query"])['result'] for _, row in df.iterrows()]

Testing the implementation ⚡

With the current setup, we have everything we need to test the RAG. For this we send a question to the RAG wrapped in a pandas.DataFrame, which will become handy in a few steps.

[ ]:
import pandas as pd

df_example = pd.DataFrame({
    "query": [
        "According to the IPCC report, what are key risks in the Europe?",
        "Is sea level rise avoidable? When will it stop?"


It’s working! The answer is coherent with what is stated in the report:

Sea level rise is unavoidable for centuries to millennia due to continuing deep ocean warming and ice sheet melt, and sea levels will remain elevated for thousands of years

(2023 Climate Change Synthesis Report, page 77)

Evaluation 🔬

With the RAG now in place, we are ready to move forward with evaluating and comparing the LLMs. First, let us make sure to define the evaluation configuration. These are meta-data needed by Giskard to run the scan:

[ ]:
     {"name": "Climate Change Question Answering",
      "description": "This model answers any question about climate change based on IPCC reports",
      "feature_names": ["query"],},

Before running the evaluation, based on the following MLflow tutorial, we first need to persist the vectore database that we previously loaded and then define a retriever function that communicates with it.

Once this is done, we can log the langchain model into mlflow following this documentation. We can then finally proceed with the evaluation of each LLM separately using the Giskard evaluator.

[ ]:
for model_name in models.keys():
    with mlflow.start_run(run_name=model_name):
                        evaluators="giskard", # <-- where the magic happens

After completing these steps, MLflow will generate a folder named mlruns that contains all the results… And… That’s it! 🎊 Let us now analyse the results.

Results 📊

After completing the previous steps, if you are on google colab, you can run the following cell to visualize the results. We employ ngrok to establish an https tunnel on port 5000, which is used by the MLflow UI. There, you will find the two LLMs logged as separate runs for comparison and analysis.

[ ]:
from pyngrok import ngrok

# Get your authtoken from

# Run the tracking UI in the background
get_ipython().system_raw("mlflow ui --port 5000 &")

# Terminate open tunnels if exist

# Open an HTTPs tunnel on port 5000 for http://localhost:5000
ngrok_tunnel = ngrok.connect(addr="5000", proto="http", bind_tls=True)
print("MLflow Tracking UI:", ngrok_tunnel.public_url)