Performance tests#

giskard.testing.test_mae(model: SuiteInput | BaseModel | None = None, dataset: SuiteInput | Dataset | None = None, slicing_function: SuiteInput | SlicingFunction | None = None, threshold: SuiteInput | float | None = 1.0, debug_percent_rows: SuiteInput | float | None = 0.3, debug: SuiteInput | bool | None = False) GiskardTestMethod[source]#

Test if the model Mean Absolute Error is lower than a threshold

Example: The test is passed when the MAE is lower than 10

Parameters:
  • model (BaseModel) – Model used to compute the test

  • dataset (Dataset) – Dataset used to compute the test

  • slicing_function (Optional[SlicingFunction]) – Slicing function to be applied on dataset

  • threshold (float) – Threshold value for MAE

  • debug_percent_rows (float) – Percentage of rows (sorted by their highest absolute error) to debug. By default 30%.

  • debug (bool) – If True and the test fails, a dataset will be provided containing the top debug_percent_rows of the rows with the highest absolute error (difference between prediction and data).

Returns:

Length of dataset tested reference_slices_size:

Length of reference_dataset tested

metric:

The MAE metric

passed:

TRUE if MAE metric <= threshold

Return type:

actual_slices_size

giskard.testing.test_rmse(model: SuiteInput | BaseModel | None = None, dataset: SuiteInput | Dataset | None = None, slicing_function: SuiteInput | SlicingFunction | None = None, threshold: SuiteInput | float | None = 1.0, debug_percent_rows: SuiteInput | float | None = 0.3, debug: SuiteInput | bool | None = False) GiskardTestMethod[source]#

Test if the model RMSE is lower than a threshold

Example: The test is passed when the RMSE is lower than 10

Parameters:
  • model (BaseModel) – Model used to compute the test

  • dataset (Dataset) – Dataset used to compute the test

  • slicing_function (Optional[SlicingFunction]) – Slicing function to be applied on dataset

  • threshold (float) – Threshold value for RMSE

  • debug_percent_rows (float) – Percentage of rows (sorted by their highest absolute error) to debug. By default 30%.

  • debug (bool) – If True and the test fails, a dataset will be provided containing the top debug_percent_rows of the rows with the highest absolute error (difference between prediction and data).

Returns:

Length of dataset tested metric:

The RMSE metric

passed:

TRUE if RMSE metric <= threshold

Return type:

actual_slices_size

giskard.testing.test_recall(model: SuiteInput | BaseModel | None = None, dataset: SuiteInput | Dataset | None = None, slicing_function: SuiteInput | SlicingFunction | None = None, threshold: SuiteInput | float | None = 1.0, debug: SuiteInput | bool | None = False) GiskardTestMethod[source]#

Test if the model Recall is higher than a threshold for a given slice

Example: The test is passed when the Recall for females is higher than 0.7

Parameters:
  • model (BaseModel) – Model used to compute the test

  • dataset (Dataset) – Actual dataset used to compute the test

  • slicing_function (Optional[SlicingFunction]) – Slicing function to be applied on dataset

  • threshold (float) – Threshold value for Recall

  • debug (bool) – If True and the test fails, a dataset will be provided containing all the incorrectly predicted rows.

Returns:

Length of dataset tested metric:

The Recall metric

passed:

TRUE if Recall metric >= threshold

Return type:

actual_slices_size

giskard.testing.test_auc(model: SuiteInput | BaseModel | None = None, dataset: SuiteInput | Dataset | None = None, slicing_function: SuiteInput | SlicingFunction | None = None, threshold: SuiteInput | float | None = 1.0, debug: SuiteInput | bool | None = False) GiskardTestMethod[source]#

Test if the model AUC performance is higher than a threshold for a given slice

Example : The test is passed when the AUC for females is higher than 0.7

Parameters:
  • model (BaseModel) – Model used to compute the test

  • dataset (Dataset) – Actual dataset used to compute the test

  • slicing_function (Optional[SlicingFunction]) – Slicing function to be applied on dataset

  • threshold (float) – Threshold value of AUC metrics

  • debug (bool) – If True and the test fails, a dataset will be provided containing all the incorrectly predicted rows.

Returns:

Length of dataset tested metric:

The AUC performance metric

passed:

TRUE if AUC metrics >= threshold

Return type:

actual_slices_size

giskard.testing.test_accuracy(model: SuiteInput | BaseModel | None = None, dataset: SuiteInput | Dataset | None = None, slicing_function: SuiteInput | SlicingFunction | None = None, threshold: SuiteInput | float | None = 1.0, debug: SuiteInput | bool | None = False) GiskardTestMethod[source]#

Test if the model Accuracy is higher than a threshold for a given slice

Example: The test is passed when the Accuracy for females is higher than 0.7

Parameters:
  • model (BaseModel) – Model used to compute the test

  • dataset (Dataset) – Actual dataset used to compute the test

  • slicing_function (Optional[SlicingFunction]) – Slicing function to be applied on dataset

  • threshold (float) – Threshold value for Accuracy

  • debug (bool) – If True and the test fails, a dataset will be provided containing all the incorrectly predicted rows.

Returns:

Length of dataset tested metric:

The Accuracy metric

passed:

TRUE if Accuracy metrics >= threshold

Return type:

actual_slices_size

giskard.testing.test_precision(model: SuiteInput | BaseModel | None = None, dataset: SuiteInput | Dataset | None = None, slicing_function: SuiteInput | SlicingFunction | None = None, threshold: SuiteInput | float | None = 1.0, debug: SuiteInput | bool | None = False) GiskardTestMethod[source]#

Test if the model Precision is higher than a threshold for a given slice

Example: The test is passed when the Precision for females is higher than 0.7

Parameters:
  • model (BaseModel) – Model used to compute the test

  • dataset (Dataset) – Actual dataset used to compute the test

  • slicing_function (Optional[SlicingFunction]) – Slicing function to be applied on dataset

  • threshold (float) – Threshold value for Precision

  • debug (bool) – If True and the test fails, a dataset will be provided containing all the incorrectly predicted rows.

Returns:

Length of dataset tested metric:

The Precision metric

passed:

TRUE if Precision metrics >= threshold

Return type:

actual_slices_size

giskard.testing.test_f1(model: SuiteInput | BaseModel | None = None, dataset: SuiteInput | Dataset | None = None, slicing_function: SuiteInput | SlicingFunction | None = None, threshold: SuiteInput | float | None = 1.0, debug: SuiteInput | bool | None = False) GiskardTestMethod[source]#

Test if the model F1 score is higher than a defined threshold for a given slice

Example: The test is passed when F1 score for females is higher than 0.7

Parameters:
  • model (BaseModel) – Model used to compute the test

  • dataset (Dataset) – Actual dataset used to compute the test

  • slicing_function (Optional[SlicingFunction]) – Slicing function to be applied on dataset

  • threshold (float) – Threshold value for F1 Score

  • debug (bool) – If True and the test fails, a dataset will be provided containing all the incorrectly predicted rows.

Returns:

Length of dataset tested metric:

The F1 score metric

passed:

TRUE if F1 Score metrics >= threshold

Return type:

actual_slices_size

giskard.testing.test_r2(model: SuiteInput | BaseModel | None = None, dataset: SuiteInput | Dataset | None = None, slicing_function: SuiteInput | SlicingFunction | None = None, threshold: SuiteInput | float | None = 1.0, debug_percent_rows: SuiteInput | float | None = 0.3, debug: SuiteInput | bool | None = False) GiskardTestMethod[source]#

Test if the model R-Squared is higher than a threshold

Example: The test is passed when the R-Squared is higher than 0.7

Parameters:
  • model (BaseModel) – Model used to compute the test

  • dataset (Dataset) – Dataset used to compute the test

  • slicing_function (Optional[SlicingFunction]) – Slicing function to be applied on dataset

  • threshold (float) – Threshold value for R-Squared

  • debug_percent_rows (float) – Percentage of rows (sorted by their highest absolute error) to debug. By default 30%.

  • debug (bool) – If True and the test fails, a dataset will be provided containing the top debug_percent_rows of the rows with the highest absolute error (difference between prediction and data).

Returns:

Length of dataset tested metric:

The R-Squared metric

passed:

TRUE if R-Squared metric >= threshold

Return type:

actual_slices_size

giskard.testing.test_diff_recall(model: SuiteInput | BaseModel | None = None, actual_dataset: SuiteInput | Dataset | None = None, reference_dataset: SuiteInput | Dataset | None = None, slicing_function: SuiteInput | SlicingFunction | None = None, threshold: SuiteInput | float | None = 0.1, direction: SuiteInput | Direction | None = Direction.Invariant, debug: SuiteInput | bool | None = False) GiskardTestMethod[source]#

Test if the absolute percentage change of model Recall between two samples is lower than a threshold

Example : The test is passed when the Recall for females has a difference lower than 10% from the Accuracy for males. For example, if the Recall for males is 0.8 (dataset) and the Recall for females is 0.6 (reference_dataset) then the absolute percentage Recall change is 0.2 / 0.8 = 0.25 and the test will fail

Parameters:
  • model (BaseModel) – Model used to compute the test

  • actual_dataset (Dataset) – Actual dataset used to compute the test

  • reference_dataset (Dataset) – Actual dataset used to compute the test

  • slicing_function (Optional[SlicingFunction]) – Slicing function to be applied on both actual and reference datasets

  • threshold (float) – Threshold value for Recall difference

  • debug (bool) – If True and the test fails, a dataset will be provided containing all the incorrectly predicted rows from both actual_dataset and reference_dataset

Returns:

Length of dataset tested reference_slices_size:

Length of reference_dataset tested

metric:

The Recall difference metric

passed:

TRUE if Recall difference < threshold

Return type:

actual_slices_size

giskard.testing.test_diff_accuracy(model: SuiteInput | BaseModel | None = None, actual_dataset: SuiteInput | Dataset | None = None, reference_dataset: SuiteInput | Dataset | None = None, slicing_function: SuiteInput | SlicingFunction | None = None, threshold: SuiteInput | float | None = 0.1, direction: SuiteInput | Direction | None = Direction.Invariant, debug: SuiteInput | bool | None = False) GiskardTestMethod[source]#

Test if the absolute percentage change of model Accuracy between two samples is lower than a threshold

Example : The test is passed when the Accuracy for females has a difference lower than 10% from the Accuracy for males. For example, if the Accuracy for males is 0.8 (dataset) and the Accuracy for females is 0.6 (reference_dataset) then the absolute percentage Accuracy change is 0.2 / 0.8 = 0.25 and the test will fail

Parameters:
  • model (BaseModel) – Model used to compute the test

  • actual_dataset (Dataset) – Actual dataset used to compute the test

  • reference_dataset (Dataset) – Reference dataset used to compute the test

  • slicing_function (Optional[SlicingFunction]) – Slicing function to be applied on both actual and reference datasets

  • threshold (float) – Threshold value for Accuracy Score difference

  • debug (bool) – If True and the test fails, a dataset will be provided containing all the incorrectly predicted rows from both actual_dataset and reference_dataset

Returns:

Length of dataset tested reference_slices_size:

Length of reference_dataset tested

metric:

The Accuracy difference metric

passed:

TRUE if Accuracy difference < threshold

Return type:

actual_slices_size

giskard.testing.test_diff_precision(model: SuiteInput | BaseModel | None = None, actual_dataset: SuiteInput | Dataset | None = None, reference_dataset: SuiteInput | Dataset | None = None, slicing_function: SuiteInput | SlicingFunction | None = None, threshold: SuiteInput | float | None = 0.1, direction: SuiteInput | Direction | None = Direction.Invariant, debug: SuiteInput | bool | None = False) GiskardTestMethod[source]#

Test if the absolute percentage change of model Precision between two samples is lower than a threshold

Example : The test is passed when the Precision for females has a difference lower than 10% from the Accuracy for males. For example, if the Precision for males is 0.8 (dataset) and the Precision for females is 0.6 (reference_dataset) then the absolute percentage Precision change is 0.2 / 0.8 = 0.25 and the test will fail

Parameters:
  • model (BaseModel) – Model used to compute the test

  • actual_dataset (Dataset) – Actual dataset used to compute the test

  • reference_dataset (Dataset) – Reference dataset used to compute the test

  • slicing_function (Optional[SlicingFunction]) – Slicing function to be applied on both actual and reference datasets

  • threshold (float) – Threshold value for Precision difference

  • debug (bool) – If True and the test fails, a dataset will be provided containing all the incorrectly predicted rows from both actual_dataset and reference_dataset

Returns:

Length of dataset tested reference_slices_size:

Length of reference_dataset tested

metric:

The Precision difference metric

passed:

TRUE if Precision difference < threshold

Return type:

actual_slices_size

giskard.testing.test_diff_rmse(model: SuiteInput | BaseModel | None = None, actual_dataset: SuiteInput | Dataset | None = None, reference_dataset: SuiteInput | Dataset | None = None, slicing_function: SuiteInput | SlicingFunction | None = None, threshold: SuiteInput | float | None = 0.1, direction: SuiteInput | Direction | None = Direction.Invariant, debug_percent_rows: SuiteInput | float | None = 0.3, debug: SuiteInput | bool | None = False) GiskardTestMethod[source]#

Test if the absolute percentage change of model RMSE between two samples is lower than a threshold

Example : The test is passed when the RMSE for females has a difference lower than 10% from the RMSE for males. For example, if the RMSE for males is 0.8 (dataset) and the RMSE for females is 0.6 (reference_dataset) then the absolute percentage RMSE change is 0.2 / 0.8 = 0.25 and the test will fail

Parameters:
  • model (BaseModel) – Model used to compute the test

  • actual_dataset (Dataset) – Actual dataset used to compute the test

  • reference_dataset (Dataset) – Reference dataset used to compute the test

  • slicing_function (Optional[SlicingFunction]) – Slicing function to be applied on both actual and reference datasets

  • threshold (float) – Threshold value for RMSE difference

  • debug_percent_rows (float) – Percentage of rows (sorted by their highest absolute error) to debug. By default 30%.

  • debug (bool) – If True and the test fails, a dataset will be provided containing the top debug_percent_rows of the rows with the highest absolute error (difference between prediction and data) from both actual_dataset and reference_dataset.

Returns:

Length of dataset tested reference_slices_size:

Length of reference_dataset tested

metric:

The RMSE difference metric

passed:

TRUE if RMSE difference < threshold

Return type:

actual_slices_size

giskard.testing.test_diff_f1(model: SuiteInput | BaseModel | None = None, actual_dataset: SuiteInput | Dataset | None = None, reference_dataset: SuiteInput | Dataset | None = None, slicing_function: SuiteInput | SlicingFunction | None = None, threshold: SuiteInput | float | None = 0.1, direction: SuiteInput | Direction | None = Direction.Invariant, debug: SuiteInput | bool | None = False) GiskardTestMethod[source]#

Test if the absolute percentage change in model F1 Score between two samples is lower than a threshold

Example : The test is passed when the F1 Score for females has a difference lower than 10% from the F1 Score for males. For example, if the F1 Score for males is 0.8 (dataset) and the F1 Score for females is 0.6 (reference_dataset) then the absolute percentage F1 Score change is 0.2 / 0.8 = 0.25 and the test will fail

Parameters:
  • model (BaseModel) – Model used to compute the test

  • actual_dataset (Dataset) – Actual dataset used to compute the test

  • reference_dataset (Dataset) – Reference dataset used to compute the test

  • slicing_function (Optional[SlicingFunction]) – Slicing function to be applied on both actual and reference datasets

  • threshold (float) – Threshold value for F1 Score difference

  • debug (bool) – If True and the test fails, a dataset will be provided containing all the incorrectly predicted rows from both actual_dataset and reference_dataset

Returns:

Length of dataset tested reference_slices_size:

Length of reference_dataset tested

metric:

The F1 Score difference metric

passed:

TRUE if F1 Score difference < threshold

Return type:

actual_slices_size