🎁 Wrap your ML model#

To scan, test and debug your model, you need to wrap it into a Giskard Model. Your model can use any ML library (HuggingFace, PyTorch, Tensorflow, Sklearn, etc.) and can be any Python function that respects the right signature. You can wrap your model in two different ways:

  • Wrap a prediction function that contains all your data pre-processing steps.

  • Wrap a model object in addition to a data pre-processing function.

Hint

Choose β€œWrap a model object” if your model is not serializable by cloudpickle (e.g. TensorFlow models).

Prediction function is any Python function that takes input as raw pandas dataframe and returns the probabilities for each classification label.

Make sure that:

  1. prediction_function encapsulates all the data pre-processing steps (categorical encoding, numerical scaling, etc.).

  2. prediction_function(df[feature_names]) does not return an error message.

from giskard import demo, Model

demo_data_processing_function, demo_sklearn_model = demo.titanic_pipeline()

def prediction_function(df):
    # The pre-processor can be a pipeline of one-hot encoding, imputer, scaler, etc.
    preprocessed_df = demo_data_processing_function(df)
    return demo_sklearn_model.predict_proba(preprocessed_df)

wrapped_model = Model(
    model=prediction_function,
    model_type="classification",
    classification_labels=demo_sklearn_model.classes_,  # Their order MUST be identical to the prediction_function's output order
    feature_names=['PassengerId', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked'],  # Default: all columns of your dataset
    # name="titanic_model", # Optional
    # classification_threshold=0.5, # Default: 0.5
)
  • Mandatory parameters

    • model: A prediction function that takes a pandas.DataFrame as input and returns an array (\(n\times m\)) of probabilities corresponding to \(n\) data entries (rows of pandas.DataFrame) and \(m\) classification_labels. In the case of binary classification, an array
      (\(n\times 1\)) of probabilities is also accepted.

    • model_type: The type of model, either regression, classification or text_generation.

    • classification_labels: The list of unique categories contained in your dataset target variable. If classification_labels is a list of \(m\) elements, make sure that:

      • prediction_function is returning a (\(n\times m\)) array of probabilities.

      • classification_labels have the same order as the output of prediction_function.

  • Optional parameters

    • name: Name of the wrapped model.

    • feature_names: An optional list of the feature names. By default, feature_names are all the columns in your dataset. Make sure these features are in the same order as they are in your training dataset.

    • classification_threshold: Model threshold for binary classification problems.

Prediction function is any Python function that takes the input as raw pandas dataframe and returns the predictions for your regression task.

Make sure that:

  1. prediction_function encapsulates all the data pre-processing steps (categorical encoding, numerical scaling, etc.).

  2. prediction_function(df[feature_names]) does not return an error message.

import numpy as np
from giskard import demo, Model

demo_data_processing_function, reg = demo.linear_pipeline()

def prediction_function(df):
    preprocessed_df = demo_data_processing_function(df)
    return np.squeeze(reg.predict(preprocessed_df))

wrapped_model = Model(
    model=prediction_function,
    model_type="regression",
    feature_names=['x'],  # Default: all columns of your dataset
    # name="linear_model", # Optional
)
  • Mandatory parameters

    • model: A prediction function that takes pandas.DataFrame as input and returns an array \(n\) of predictions corresponding to \(n\) data entries (rows of pandas.DataFrame).

    • model_type: The type of model, either regression, classification or text_generation.

  • Optional parameters

    • name: Name of the wrapped model.

    • feature_names: An optional list of the feature names. By default, feature_names are all the columns in your dataset. Make sure these features are in the same order as they are in your training dataset.

Prediction function is any Python function that takes the input as raw pandas dataframe and returns the predictions for your text generation task.

Make sure that:

  1. prediction_function encapsulates all the data pre-processing steps (categorical encoding, numerical scaling, etc.).

  2. prediction_function(df[feature_names]) does not return an error message.

from langchain.chains import LLMChain
from langchain.llms.fake import FakeListLLM
from langchain.prompts import PromptTemplate
from giskard import Model

responses = [
    "\n\nHueFoots.", "\n\nEcoDrive Motors.", 
    "\n\nRainbow Socks.", "\n\nNoOil Motors."]

llm = FakeListLLM(responses=responses)
prompt = PromptTemplate(
    input_variables=["product"],
    template="What is a good name for a company that makes {product}?",
)
chain = LLMChain(llm=llm, prompt=prompt)

def prediction_function(df):
    return [chain.predict(**data) for data in df.to_dict('records')]

wrapped_model = Model(prediction_function, model_type='text_generation')
  • Mandatory parameters

    • model: A prediction function that takes pandas.DataFrame as input and returns an array \(n\) of predictions corresponding to \(n\) data entries (rows of pandas.DataFrame).

    • model_type: The type of model, either regression, classification or text_generation.

  • Optional parameters

    • name: Name of the wrapped model.

    • feature_names: An optional list of the feature names. By default, feature_names are all the columns in your dataset. Make sure these features are in the same order as they are in your training dataset.

Providing the model object to Model allows us to automatically infer the ML library of your model object and provide a suitable serialization method (provided by save_model and load_model methods).

This requires:

  • Mandatory: Overriding the model_predict method which should take the input as raw pandas dataframe and return the probabilities for each classification labels (classification) or predictions (regression or text_generation).

  • Optional: Our pre-defined serialization and prediction methods cover the sklearn, catboost, pytorch, tensorflow, huggingface and langchain libraries. If none of these libraries are detected, cloudpickle is used as the default for serialization. If this fails, we will ask you to also override the save_model and load_model methods where you provide your own serialization of the model object.

from giskard import demo, Model

demo_data_processing_function, demo_sklearn_model = demo.titanic_pipeline()

class MyCustomModel(Model):
    def model_predict(self, df):
        preprocessed_df = demo_data_processing_function(df)
        return self.model.predict_proba(preprocessed_df)

wrapped_model = MyCustomModel(
    model=demo_sklearn_model,
    model_type="classification",
    classification_labels=demo_sklearn_model.classes_,  # Their order MUST be identical to the prediction_function's output order
    feature_names=['PassengerId', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare',
                 'Embarked', 'Survived'],  # Default: all columns of your dataset
    # name="titanic_model", # Optional
    # classification_threshold=0.5, # Default: 0.5
    # model_postprocessing_function=None, # Optional
    # **kwargs # Additional model-specific arguments
)
  • Mandatory parameters

    • model: Could be any model from sklearn, catboost, pytorch, tensorflow, huggingface or langchain (check the tutorials). If none of these libraries apply to you, we try to serialize your model with cloudpickle. If that also does not work, we ask you to provide us with your own serialization method.

    • model_type: The type of the model, either regression, classification or text_generation.

    • classification_labels: The list of unique categories contained in your dataset target variable. If classification_labels is a list of \(m\) elements, make sure that:

      • prediction_function is returning a (\(n\times m\)) array of probabilities.

      • classification_labels have the same order as the output of prediction_function.

  • Optional parameters

    • name: Name of the wrapped model.

    • feature_names: An optional list of the feature names. By default, feature_names are all the columns in your dataset. Make sure these features are in the same order as they are in your training dataset.

    • classification_threshold: Model threshold for binary classification problems.

    • data_preprocessing_function: A function that takes a pandas.DataFrame as raw input, applies pre-processing and returns any object that could be directly fed to model.

    • model_postprocessing_function: A function that takes a model output as input, applies post-processing and returns an object of the same type and shape as the model output.

    • **kwargs: Additional model-specific arguments (See Models).

import numpy as np
from giskard import demo, Model

demo_data_processing_function, reg = demo.linear_pipeline()

class MyCustomModel(Model):
    def model_predict(self, df):
        preprocessed_df = demo_data_processing_function(df)
        return np.squeeze(self.model.predict(preprocessed_df))

wrapped_model = MyCustomModel(
    model=reg,
    model_type="regression",
    feature_names=['x'],  # Default: all columns of your dataset
    # name="my_regression_model", # Optional
    # model_postprocessing_function=None, # Optional
    # **kwargs # Additional model-specific arguments
)
  • Mandatory parameters

    • model: Could be any model from sklearn, catboost, pytorch, tensorflow, huggingface or langchain (check the tutorials). If none of these libraries apply to you, we try to serialize your model with cloudpickle. If that also does not work, we ask you to provide us with your own serialization method.

    • model_type: The type of the model, either regression, classification or text_generation.

  • Optional parameters

    • name: Name of the wrapped model.

    • feature_names: An optional list of the feature names. By default, feature_names are all the columns in your dataset. Make sure these features are in the same order as your training dataset.

    • data_preprocessing_function: A function that takes a pandas.DataFrame as raw input, applies pre-processing and returns any object that could be directly fed to model.

    • model_postprocessing_function: A function that takes a model output as input, applies post-processing and returns an object of the same type and shape as the model output.

    • **kwargs: Additional model-specific arguments (See Models).

from langchain.chains import LLMChain
from langchain.llms.fake import FakeListLLM
from langchain.prompts import PromptTemplate
from giskard import Model

responses = [
    "\n\nHueFoots.", "\n\nEcoDrive Motors.", 
    "\n\nRainbow Socks.", "\n\nNoOil Motors."]

llm = FakeListLLM(responses=responses)
prompt = PromptTemplate(
    input_variables=["product"],
    template="What is a good name for a company that makes {product}?",
)
chain = LLMChain(llm=llm, prompt=prompt)

class MyCustomModel(Model):
    def model_predict(self, df):
        return [self.model.predict(**data) for data in df.to_dict('records')]

wrapped_model = MyCustomModel(chain, model_type='text_generation')
  • Mandatory parameters

    • model: Could be any model from sklearn, catboost, pytorch, tensorflow, huggingface or langchain (check the tutorials). If none of these libraries apply to you, we try to serialize your model with cloudpickle. If that also does not work, we ask you to provide us with your own serialization method.

    • model_type: The type of the model, either regression, classification or text_generation.

  • Optional parameters

    • name: Name of the wrapped model.

    • feature_names: An optional list of the feature names. By default, feature_names are all the columns in your dataset. Make sure these features are in the same order as your training dataset.

    • data_preprocessing_function: A function that takes a pandas.DataFrame as raw input, applies pre-processing and returns any object that could be directly fed to model.

    • model_postprocessing_function: A function that takes a model output as input, applies post-processing and returns an object of the same type and shape as the model output.

    • **kwargs: Additional model-specific arguments (See Models).

Model-specific tutorials#

To check some examples of model wrapping, have a look at our tutorial section. We present there some notebooks based on:

  • ML libraries: HuggingFace, Langchain, API REST, PyTorch, Scikit-learn, LightGBM, Tensorflow

  • ML task: Classification, Regression and Text generation

  • Data types: Tabular, Text and Text generataion

Upload your model to the Giskard server#

Uploading the model to the Giskard server enables you to:

  • Compare your model with others using a test suite.

  • Gather feedback from your colleagues regarding your model.

  • Debug your model effectively in case of test failures.

  • Develop new tests that incorporate additional domain knowledge.

To upload your model to the Giskard server, go to Upload an object to the Giskard server.