Open In Colab View Notebook on GitHub

MLFlow Example - LLM#

Detecting LLM vulnerabilities in MLflow with Giskard#

This example demonstrates how to efficiently scan two LLMs for hidden vulnerabilities using Giskard and interpret the results within MLflow through just a few lines of code. The LLMs used are:



Max Tokens

Training data


Capable of very simple tasks, usually the fastest model in the GPT-3 series, and lowest cost.

2049 tokens

Up to Oct 2019


Most capable GPT-3 model. Can do any task the other models can do, often with higher quality.

2049 tokens

Up to Oct 2019

Based on the following simple prompt:

[ ]:
from langchain import PromptTemplate
prompt = PromptTemplate(template="Create a reader comment according to the following article summary: '{text}'",

We will populate 1000 article summaries from the following dataset, that consists of 4515 examples gathered from Hindu, Indian times and Guardian. Time period ranges from February to august 2017.

[ ]:
import pandas as pd
df = pd.read_csv('')
df_sample = pd.DataFrame(df["text"].sample(1000, random_state=11))

With the prompt and dataset in place, we are ready to move forward with evaluating and comparing the LLMs. First, make sure to set up your OpenAI API key:

[ ]:
import openai
openai.api_key = "YOUR_OPENAI_API_KEY"

The initial step involves loading the two models using the langchain library. Next, we log the models in mlflow, and finally, we proceed with the evaluation of each LLM separately using the Giskard evaluator.

[ ]:
import mlflow
import openai
from langchain import llms, LLMChain

models = ["text-ada-001", "text-davinci-001"]

for model in models:
    llm = llms.OpenAI(openai_api_key=openai.api_key,

    chain = LLMChain(prompt=prompt, llm=llm)

    with mlflow.start_run(run_name=model):
        model_uri = mlflow.langchain.log_model(chain, "langchain").model_uri

After completing the previous steps, you can run mlflow ui from the directory where the mlruns folder is located, which will enable you to visualize the results. By accessing, you will be presented with the interface. There, you will find the two LLMs logged as separate runs for comparison and analysis.


For each LLM, the following artifacts will be logged: 33c49713086246eeb3a34f315469257b

The giskard scan results: a6be951fc2af4d1e921201aaf358eee9

The metrics generated by the scan: 6e074d54ff0f437e8c5c7b46b9abed9c

A scan summary: After each model evaluation, a scan-summary.json file is created, enabling a comparison of vulnerabilities and metrics for each model in the Artifact view. 665e031475e04930b4615f467d4d4b90