Dataset¶

To scan, test and debug your model, you need to provide a dataset that can be executed by your model. This dataset can be your training, testing, golden, or production dataset.

The pandas.DataFrame you provide should contain the raw data before pre-processing (categorical encoding, scaling, etc.). The prediction function that you wrap with the Giskard Model should be able to execute the pandas dataframe.

df¶

A pandas.DataFrame that contains the raw data (before all the pre-processing steps) and the actual ground truth variable (target). df can contain more columns than the features of the model, such as the sample_id, metadata, etc.

Type:: pandas.DataFrame

name¶

A string representing the name of the dataset (default None).

Type:: Optional[str]

target¶

The column name in df corresponding to the actual target variable (ground truth).

Type:: Optional[str]

cat_columns¶

A list of strings representing the names of categorical columns (default None). If not provided, the categorical columns will be automatically inferred.

Type:: Optional[List[str]]

column_types¶

A dictionary of column names and their types (numeric, category or text) for all columns of df. If not provided, the categorical columns will be automatically inferred.

Type:: Optional[Dict[str, str]]

Initializes a Dataset object.

Parameters:

df (pd.DataFrame) – The input dataset as a pandas DataFrame.
name (Optional[str]) – The name of the dataset.
target (Optional[str]) – The column name in df corresponding to the actual target variable (ground truth). The target needs to be explicitly set to None if the dataset doesn’t have any target variable.
cat_columns (Optional[List[str]]) – A list of column names that are categorical.
column_types (Optional[Dict[str, str]]) – A dictionary mapping column names to their types.
id (Optional[uuid.UUID]) – A UUID that uniquely identifies this dataset.

Notes

if neither of cat_columns or column_types are provided. We infer heuristically the types of the columns. See the _infer_column_types method.

Initializes a Dataset object.

Parameters:

df (pd.DataFrame) – The input dataset as a pandas DataFrame.
name (Optional[str]) – The name of the dataset.
target (Optional[str]) – The column name in df corresponding to the actual target variable (ground truth). The target needs to be explicitly set to None if the dataset doesn’t have any target variable.
cat_columns (Optional[List[str]]) – A list of column names that are categorical.
column_types (Optional[Dict[str, str]]) – A dictionary mapping column names to their types.
id (Optional[uuid.UUID]) – A UUID that uniquely identifies this dataset.

Notes

if neither of cat_columns or column_types are provided. We infer heuristically the types of the columns. See the _infer_column_types method.

_infer_column_types(column_types: Dict[str, str] | None, cat_columns: List[str] | None, validation: bool = True)[source]¶

This function infers the column types of a given DataFrame based on the number of unique values and column data types. It takes into account the provided column types and categorical columns. The inferred types can be ‘text’, ‘numeric’, or ‘category’. The function also applies a logarithmic rule to determine the category threshold.

Here’s a summary of the function’s logic:

If no column types are provided, initialize an empty dictionary.
Determine the columns in the DataFrame, excluding the target column if it exists.
If categorical columns are specified, prioritize them over the provided column types and mark them as ‘category’.
Check for any unknown columns in the provided column types and remove them from the dictionary.
If there are no missing columns, remove the target column (if present) from the column types dictionary.
Calculate the number of unique values in each missing column.
For each missing column:
- If the number of unique values is less than or equal to the category threshold, categorize it as ‘category’.
- Otherwise, attempt to convert the column to numeric using pd.to_numeric and categorize it as ‘numeric’.
- If the column does not have the expected numeric data type and validation is enabled, issue a warning message.
- If conversion to numeric raises a ValueError, categorize the column as ‘text’.
Return the column types dictionary.

The logarithmic rule is used to calculate the category threshold. The formula is: category_threshold = round(np.log10(len(self.df))) if len(self.df) >= 100 else 2. This means that if the length of the DataFrame is greater than or equal to 100, the category threshold is set to the rounded value of the base-10 logarithm of the DataFrame length. Otherwise, the category threshold is set to 2. The logarithmic rule helps in dynamically adjusting the category threshold based on the size of the DataFrame.

Returns:: A dictionary that maps column names to their inferred types, one of ‘text’, ‘numeric’, or ‘category’.
Return type:: dict

add_slicing_function(slicing_function: SlicingFunction)[source]¶

Adds a slicing function to the data processor’s list of steps.

Parameters:: slicing_function (SlicingFunction) – A slicing function to add to the data processor.

add_transformation_function(transformation_function: TransformationFunction)[source]¶

Add a transformation function to the data processor’s list of steps.

Parameters:: transformation_function (TransformationFunction) – A transformation function to add to the data processor.

slice(slicing_function: SlicingFunction | Callable[[...], bool], row_level: bool = True, get_mask: bool = False, cell_level=False, column_name: str | None = None)[source]¶

Slice the dataset using the specified slicing_function.

Parameters:

slicing_function (Union[SlicingFunction, SlicingFunctionType]) – A slicing function to apply. If slicing_function is a callable, it will be wrapped in a SlicingFunction object with row_level and cell_level as its arguments. The SlicingFunction object will be used to slice the DataFrame. If slicing_function is a SlicingFunction object, it will be used directly to slice the DataFrame.
row_level (bool) – Whether the slicing_function should be applied to the rows (True) or the whole dataframe (False). Defaults to True.
get_mask (bool) – Whether the slicing_function returns a dataset (False) or a mask, i.e. a list of indices (True).
cell_level (bool) – Whether the slicing_function should be applied to the cells (True) or the whole dataframe (False). Defaults to False.

Returns:

The sliced dataset as a Dataset object.

Return type:

Dataset

Notes

Raises TypeError: If slicing_function is not a callable or a SlicingFunction object.

transform(transformation_function: TransformationFunction | Callable[[...], Series | DataFrame], row_level: bool = True, cell_level=False, column_name: str | None = None)[source]¶

Transform the data in the current Dataset by applying a transformation function.

Parameters:

transformation_function (Union[TransformationFunction, TransformationFunctionType]) – A transformation function to apply. If transformation_function is a callable, it will be wrapped in a TransformationFunction object with row_level and cell_level as its arguments. If transformation_function is a TransformationFunction object, it will be used directly to transform the DataFrame.
row_level (bool) – Whether the transformation_function should be applied to the rows (True) or the whole dataframe (False). Defaults to True.
cell_level (bool) – Whether the slicing_function should be applied to the cells (True) or the whole dataframe (False). Defaults to False.

Returns:

A new Dataset object containing the transformed data.

Return type:

Dataset

Notes

Raises TypeError: If transformation_function is not a callable or a TransformationFunction object.

process()[source]¶

Process the dataset by applying all the transformation and slicing functions in the defined order.

Returns:: The processed dataset after applying all the transformation and slicing functions.

upload(client: GiskardClient, project_key: str)[source]¶

Uploads the dataset to the specified Giskard project.

Parameters:

client – A GiskardClient instance for connecting to the Giskard API.
project_key (str) – The key of the project to upload the dataset to.

Returns:

The ID of the uploaded dataset.

Return type:

str

classmethod download(client: GiskardClient | None, project_key, dataset_id, sample: bool = False)[source]¶

Downloads a dataset from a Giskard project and returns a Dataset object. If the client is None, then the function assumes that it is running in an internal worker and looks for the dataset locally.

Parameters:

client (GiskardClient) – The GiskardClient instance to use for downloading the dataset. If None, the function looks for the dataset locally.
project_key (str) – The key of the Giskard project that the dataset belongs to.
dataset_id (str) – The ID of the dataset to download.
sample (bool) – Only open a sample of 1000 rows if True

Returns:

A Dataset object that represents the downloaded dataset.

Return type:

Dataset