Metamorphic tests#

giskard.testing.test_metamorphic_invariance(model: SuiteInput | BaseModel | None = None, dataset: SuiteInput | Dataset | None = None, transformation_function: SuiteInput | TransformationFunction | None = None, slicing_function: SuiteInput | SlicingFunction | None = None, threshold: SuiteInput | float | None = 0.5, output_sensitivity: SuiteInput | float | None = None, debug: SuiteInput | bool | None = False) GiskardTestMethod[source]#

Summary: Tests if the model prediction is invariant when the feature values are perturbed

Description: - For classification: Test if the predicted classification label remains the same after feature values perturbation. For regression: Check whether the predicted output remains the same at the output_sensibility level after feature values perturbation.

The test is passed when the ratio of invariant rows is higher than the threshold

Example : The test is passed when, after switching gender from male to female, more than 50%(threshold 0.5) of males have unchanged outputs

Parameters:
  • model (BaseModel) – Model used to compute the test

  • dataset (Dataset) – Dataset used to compute the test

  • transformation_function (TransformationFunction) – Function performing the perturbations to be applied on dataset.

  • slicing_function (Optional[SlicingFunction]) – Slicing function to be applied on dataset

  • output_sensitivity (float) – Optional. The threshold for ratio between the difference between perturbed prediction and actual prediction over the actual prediction for a regression model. We consider there is a prediction difference for regression if the ratio is above the output_sensitivity of 0.1

  • debug (bool) – If True and the test fails, a dataset will be provided containing the non-invariant rows.

Returns:

Length of dataset tested message:

Test result message

metric:

The ratio of unchanged rows over the perturbed rows

passed:

TRUE if metric > threshold

Return type:

actual_slices_size

giskard.testing.test_metamorphic_increasing(model: SuiteInput | BaseModel | None = None, dataset: SuiteInput | Dataset | None = None, transformation_function: SuiteInput | TransformationFunction | None = None, slicing_function: SuiteInput | SlicingFunction | None = None, threshold: SuiteInput | float | None = 0.5, classification_label: SuiteInput | str | None = None, debug: SuiteInput | bool | None = False) GiskardTestMethod[source]#

Summary: Tests if the model probability increases when the feature values are perturbed

Description: - - For classification: Test if the model probability of a given classification_label is increasing after feature values perturbation.

  • For regression: Test if the model prediction is increasing after feature values perturbation.

The test is passed when the percentage of rows that are increasing is higher than the threshold

ExampleFor a credit scoring model, the test is passed when a decrease of wage by 10%,

default probability is increasing for more than 50% of people in the dataset

Parameters:
  • model (BaseModel) – Model used to compute the test

  • dataset (Dataset) – Dataset used to compute the test

  • transformation_function (TransformationFunction) – Function performing the perturbations to be applied on dataset.

  • slicing_function (Optional[SlicingFunction]) – Slicing function to be applied on dataset

  • classification_label (str) – Optional.One specific label value from the target column

  • debug (bool) – If True and the test fails, a dataset will be provided containing the non-increasing rows.

Returns:

Length of dataset tested message:

Test result message

metric:

The ratio of increasing rows over the perturbed rows

passed:

TRUE if metric > threshold

Return type:

actual_slices_size

giskard.testing.test_metamorphic_decreasing(model: SuiteInput | BaseModel | None = None, dataset: SuiteInput | Dataset | None = None, transformation_function: SuiteInput | TransformationFunction | None = None, slicing_function: SuiteInput | SlicingFunction | None = None, threshold: SuiteInput | float | None = 0.5, classification_label: SuiteInput | str | None = None, debug: SuiteInput | bool | None = False) GiskardTestMethod[source]#

Summary: Tests if the model probability decreases when the feature values are perturbed

Description: - - For classification: Test if the model probability of a given classification_label is decreasing after feature values perturbation.

  • For regression: Test if the model prediction is decreasing after feature values perturbation.

The test is passed when the percentage of rows that are decreasing is higher than the threshold

ExampleFor a credit scoring model, the test is passed when an increase of wage by 10%,

default probability is decreasing for more than 50% of people in the dataset

Parameters:
  • model (BaseModel) – Model used to compute the test

  • dataset (Dataset) – Dataset used to compute the test

  • transformation_function (TransformationFunction) – Function performing the perturbations to be applied on dataset.

  • slicing_function (Optional[SlicingFunction]) – Slicing function to be applied on dataset

  • threshold (float) – Threshold of the ratio of decreasing rows

  • classification_label (str) – Optional. One specific label value from the target column

  • debug (bool) – If True and the test fails, a dataset will be provided containing the non-decreasing rows.

Returns:

Length of dataset tested message:

Test result message

metric:

The ratio of decreasing rows over the perturbed rows

passed:

TRUE if metric > threshold

Return type:

actual_slices_size

giskard.testing.test_metamorphic_decreasing_t_test(model: SuiteInput | BaseModel | None = None, dataset: SuiteInput | Dataset | None = None, transformation_function: SuiteInput | TransformationFunction | None = None, slicing_function: SuiteInput | SlicingFunction | None = None, critical_quantile: SuiteInput | float | None = 0.05, classification_label: SuiteInput | str | None = None, debug: SuiteInput | bool | None = False) GiskardTestMethod[source]#

Summary: Tests if the model probability decreases when the feature values are perturbed

Description: Calculate the t-test on TWO RELATED samples. Sample (A) is the original probability predictions while sample (B) is the probabilities after perturbation of one or more of the features. This test computes the decreasing test to study if mean(B) < mean(A) The test is passed when the p-value of the t-test between (A) and (B) is below the critical quantile

Example: For a credit scoring model, the test is passed when a decrease of wage by 10%,

causes a statistically significant probability decrease.

Parameters:
  • model (BaseModel) – Model used to compute the test

  • dataset (Dataset) – Dataset used to compute the test

  • transformation_function (TransformationFunction) – Function performing the perturbations to be applied on dataset.

  • slicing_function (Optional[SlicingFunction]) – Slicing function to be applied on dataset

  • critical_quantile (float) – Critical quantile above which the null hypothesis cannot be rejected

  • debug (bool) – If True and the test fails, a dataset will be provided containing the non-decreasing rows.

Returns:

Length of dataset tested message:

Test result message

metric:

The t-test in terms of p-value between unchanged rows over the perturbed rows

passed:

TRUE if the p-value of the t-test between (A) and (B) is below the critical value

Return type:

actual_slices_size

giskard.testing.test_metamorphic_increasing_t_test(model: SuiteInput | BaseModel | None = None, dataset: SuiteInput | Dataset | None = None, transformation_function: SuiteInput | TransformationFunction | None = None, slicing_function: SuiteInput | SlicingFunction | None = None, critical_quantile: SuiteInput | float | None = 0.05, classification_label: SuiteInput | str | None = None, debug: SuiteInput | bool | None = False) GiskardTestMethod[source]#

Summary: Tests if the model probability increases when the feature values are perturbed

Description: Calculate the t-test on TWO RELATED samples. Sample (A) is the original probability predictions while sample (B) is the probabilities after perturbation of one or more of the features. This test computes the increasing test to study if mean(A) < mean(B) The test is passed when the p-value of the t-test between (A) and (B) is below the critical quantile

Example: For a credit scoring model, the test is passed when a decrease of wage by 10%,

causes a statistically significant probability increase.

Parameters:
  • model (BaseModel) – Model used to compute the test

  • dataset (Dataset) – Dataset used to compute the test

  • transformation_function (TransformationFunction) – Function performing the perturbations to be applied on dataset.

  • slicing_function (Optional[SlicingFunction]) – Slicing function to be applied on dataset

  • critical_quantile (float) – Critical quantile above which the null hypothesis cannot be rejected

  • debug (bool) – If True and the test fails, a dataset will be provided containing the non-increasing rows.

Returns:

Length of dataset tested message:

Test result message

metric:

The t-test in terms of p-value between unchanged rows over the perturbed rows

passed:

TRUE if the p-value of the t-test between (A) and (B) is below the critical value

Return type:

actual_slices_size

giskard.testing.test_metamorphic_invariance_t_test(model: SuiteInput | BaseModel | None = None, dataset: SuiteInput | Dataset | None = None, transformation_function: SuiteInput | TransformationFunction | None = None, slicing_function: SuiteInput | SlicingFunction | None = None, window_size: SuiteInput | float | None = 0.2, critical_quantile: SuiteInput | float | None = 0.05, debug: SuiteInput | bool | None = False) GiskardTestMethod[source]#

Summary: Tests if the model predictions are statistically invariant when the feature values are perturbed.

Description: Calculate the t-test on TWO RELATED samples. Sample (A) is the original probability predictions while sample (B) is the probabilities after perturbation of one or more of the features. This test computes the equivalence test to show that mean(B) - window_size/2 < mean(A) < mean(B) + window_size/2 The test is passed when the following tests pass:

  • the p-value of the t-test between (A) and (B)+window_size/2 is below the critical quantile

  • the p-value of the t-test between (B)-window_size/2 and (A) is below the critical quantile

Example: The test is passed when, after switching gender from male to female, the probability distributions remains statistically invariant. In other words, the test is passed if the mean of the perturbed sample is statistically within a window determined by the user.

Parameters:
  • model (BaseModel) – Model used to compute the test

  • dataset (Dataset) – Dataset used to compute the test

  • transformation_function (TransformationFunction) – Function performing the perturbations to be applied on dataset.

  • slicing_function (Optional[SlicingFunction]) – Slicing function to be applied on dataset

  • window_size (float) – Probability window in which the mean of the perturbed sample can be in

  • critical_quantile – Critical quantile above which the null hypothesis cannot be rejected

Returns:

Length of dataset tested message:

Test result message

metric:

The t-test in terms of p-value between unchanged rows over the perturbed rows

passed:

TRUE if the p-value of the t-test between (A) and (B)+window_size/2 < critical_quantile && the p-value of the t-test between (B)-window_size/2 and (A) < critical_quantile

Return type:

actual_slices_size

giskard.testing.test_metamorphic_decreasing_wilcoxon(model: SuiteInput | BaseModel | None = None, dataset: SuiteInput | Dataset | None = None, transformation_function: SuiteInput | TransformationFunction | None = None, slicing_function: SuiteInput | SlicingFunction | None = None, critical_quantile: SuiteInput | float | None = 0.05, classification_label: SuiteInput | str | None = None, debug: SuiteInput | bool | None = False) GiskardTestMethod[source]#

Summary: Tests if the model probability decreases when the feature values are perturbed

Description: Calculate the Wilcoxon signed-rank test on TWO RELATED samples. Sample (A) is the original probability predictions while sample (B) is the probabilities after perturbation of one or more of the features. This test computes the decreasing test to study if mean(B) < mean(A) The test is passed when the p-value of the Wilcoxon signed-rank test between (A) and (B) is below the critical quantile

Example: For a credit scoring model, the test is passed when a decrease of wage by 10%,

causes a statistically significant probability decrease.

Parameters:
  • model (BaseModel) – Model used to compute the test

  • dataset (Dataset) – Dataset used to compute the test

  • transformation_function (TransformationFunction) – Function performing the perturbations to be applied on dataset.

  • slicing_function (Optional[SlicingFunction]) – Slicing function to be applied on dataset

  • critical_quantile (float) – Critical quantile above which the null hypothesis cannot be rejected

  • debug (bool) – If True and the test fails, a dataset will be provided containing the non-decreasing rows.

Returns:

Length of dataset tested message:

Test result message

metric:

The Wilcoxon signed-rank test in terms of p-value between unchanged rows over the perturbed rows

passed:

TRUE if the p-value of the Wilcoxon signed-rank test between (A) and (B) is below the critical value

Return type:

actual_slices_size

giskard.testing.test_metamorphic_increasing_wilcoxon(model: SuiteInput | BaseModel | None = None, dataset: SuiteInput | Dataset | None = None, transformation_function: SuiteInput | TransformationFunction | None = None, slicing_function: SuiteInput | SlicingFunction | None = None, critical_quantile: SuiteInput | float | None = 0.05, classification_label: SuiteInput | str | None = None, debug: SuiteInput | bool | None = False) GiskardTestMethod[source]#

Summary: Tests if the model probability increases when the feature values are perturbed

Description: Calculate the Wilcoxon signed-rank test on TWO RELATED samples. Sample (A) is the original probability predictions while sample (B) is the probabilities after perturbation of one or more of the features. This test computes the increasing test to study if mean(A) < mean(B) The test is passed when the p-value of the Wilcoxon signed-rank test between (A) and (B) is below the critical quantile

Example: For a credit scoring model, the test is passed when a decrease of wage by 10%,

causes a statistically significant probability increase.

Parameters:
  • model (BaseModel) – Model used to compute the test

  • dataset (Dataset) – Dataset used to compute the test

  • transformation_function (TransformationFunction) – Function performing the perturbations to be applied on dataset.

  • slicing_function (Optional[SlicingFunction]) – Slicing function to be applied on dataset

  • critical_quantile (float) – Critical quantile above which the null hypothesis cannot be rejected

  • debug (bool) – If True and the test fails, a dataset will be provided containing the non-increasing rows.

Returns:

Length of dataset tested message:

Test result message

metric:

The Wilcoxon signed-rank test in terms of p-value between unchanged rows over the perturbed rows

passed:

TRUE if the p-value of the Wilcoxon signed-rank test between (A) and (B) is below the critical value

Return type:

actual_slices_size

giskard.testing.test_metamorphic_invariance_wilcoxon(model: SuiteInput | BaseModel | None = None, dataset: SuiteInput | Dataset | None = None, transformation_function: SuiteInput | TransformationFunction | None = None, slicing_function: SuiteInput | SlicingFunction | None = None, window_size: SuiteInput | float | None = 0.2, critical_quantile: SuiteInput | float | None = 0.05, debug: SuiteInput | bool | None = False) GiskardTestMethod[source]#

Summary: Tests if the model predictions are statistically invariant when the feature values are perturbed.

Description: Calculate the Wilcoxon signed-rank test on TWO RELATED samples. Sample (A) is the original probability predictions while sample (B) is the probabilities after perturbation of one or more of the features. This test computes the equivalence test to show that mean(B) - window_size/2 < mean(A) < mean(B) + window_size/2 The test is passed when the following tests pass: - the p-value of the t-test between (A) and (B)+window_size/2 is below the critical quantile - the p-value of the t-test between (B)-window_size/2 and (A) is below the critical quantile

Example: The test is passed when, after switching gender from male to female, the probability distributions remains statistically invariant. In other words, the test is passed if the mean of the perturbed sample is statistically within a window determined by the user.

Parameters:
  • model (BaseModel) – Model used to compute the test

  • dataset (Dataset) – Dataset used to compute the test

  • transformation_function (TransformationFunction) – Function performing the perturbations to be applied on dataset.

  • slicing_function (Optional[SlicingFunction]) – Slicing function to be applied on dataset

  • window_size (float) – Probability window in which the mean of the perturbed sample can be in

  • critical_quantile (float) – Critical quantile above which the null hypothesis cannot be rejected

  • debug (bool) – If True and the test fails, a dataset will be provided containing the non-invariant rows.

Returns:

Length of dataset tested message:

Test result message

metric:

The t-test in terms of p-value between unchanged rows over the perturbed rows

passed:

TRUE if the p-value of the Wilcoxon signed-rank test between (A) and (B)+window_size/2 < critical_quantile && the p-value of the t-test between (B)-window_size/2 and (A) < critical_quantile

Return type:

actual_slices_size