Metamorphic tests¶

Tests if model prediction is invariant to perturbations

For classification: Test if the predicted classification label remains the same after feature values perturbation. For regression: Check whether the predicted output remains the same at the output_sensibility level after feature values perturbation.

The test is passed when the ratio of invariant rows is higher than the threshold.

Example: The test is passed when, by switching gender from male to female, more than 50%(threshold 0.5) of males have unchanged outputs.

Parameters:

model (BaseModel) – Model used to compute the test
dataset (Dataset) – Dataset used to compute the test
transformation_function (TransformationFunction) – Function performing the perturbations to be applied on dataset.
slicing_function (Optional[SlicingFunction]) – Slicing function to be applied on dataset.
threshold (float) – The threshold value for the ratio of invariant rows.
output_sensitivity (float) – For regression models. The threshold for ratio between the difference between perturbed prediction and actual prediction over the actual prediction for a regression model. We consider there is a prediction difference for regression if the ratio is above the output_sensitivity of 0.1.
debug (bool) – If True and the test fails, a dataset will be provided containing the non-invariant rows.

Returns:

A TestResult object containing the test result.

Return type:

TestResult

Tests if the model prediction increases when the features are perturbed.

For classification models, it tests if the model probability of a given classification_label is increasing after feature values perturbation.

For regression models, it tests if the model prediction is increasing after feature values perturbation.

The test is passed when the percentage of rows that are increasing is higher than the threshold.

Example: For a credit scoring model, the test is passed when a decrease of wage by 10%, default probability is increasing for more than 50% of people in the dataset.

Parameters:

model (BaseModel) – Model used to compute the test
dataset (Dataset) – Dataset used to compute the test
transformation_function (TransformationFunction) – Function performing the perturbations to be applied on dataset.
slicing_function (Optional[SlicingFunction]) – Slicing function to be applied on dataset (Default value = None)
threshold (float) – The threshold value for the ratio of increasing rows. Default is 0.5.
classification_label (str) – One specific label value from the target column (only for classification models).
debug (bool) – If True and the test fails, a dataset will be provided containing the non-increasing rows.

Returns:

A TestResult object containing the test result.

Return type:

TestResult

Tests if the model prediction decreases when features are perturbed

For classification models, it tests if the model probability of a given classification_label is decreasing after feature values perturbation.

For regression models, it tests if the model prediction is decreasing after feature values perturbation.

The test is passed when the percentage of rows that are decreasing is higher than the threshold.

Example: For a credit scoring model, the test is passed when an increase of wage by 10%, default probability is decreasing for more than 50% of people in the dataset.

Parameters:

model (BaseModel) – Model used to compute the test
dataset (Dataset) – Dataset used to compute the test
transformation_function (TransformationFunction) – Function performing the perturbations to be applied on dataset.
slicing_function (Optional[SlicingFunction]) – Slicing function to be applied on dataset
threshold (float) – Threshold of the ratio of decreasing rows
classification_label (str) – Optional. One specific label value from the target column
debug (bool) – If True and the test fails, a dataset will be provided containing the non-decreasing rows.

Returns:

A TestResult object containing the test result.

Return type:

TestResult

Tests if the model prediction decreases when the feature are perturbed

Performs a t-test on two related samples. Sample A is constituted by the original predictions (probability of classification_label for classification models, or predicted value for regression models). Sample B is constituted by the predictions after perturbation of one or more of the features (by tranformation_function).

It performs a t-test to study if mean(B) < mean(A).

The test is passed when the p-value of the t-test between (A) and (B) is below the critical quantile.

Example: For a credit scoring model, the test is passed when an increase of wage by 10%, causes a statistically significant decrease of the default probability.

Parameters:

model (BaseModel) – Model used to compute the test
dataset (Dataset) – Dataset used to compute the test
transformation_function (TransformationFunction) – Function performing the perturbations to be applied on dataset.
slicing_function (Optional[SlicingFunction]) – Slicing function to be applied on dataset
critical_quantile (float) – Critical quantile above which the null hypothesis cannot be rejected
classification_label (str) – (Default value = None)
debug (bool) – If True and the test fails, a dataset will be provided containing the non-decreasing rows.

Returns:

A TestResult object containing the test result.

Return type:

TestResult

Tests if the model prediction increases when feature values are perturbed

It performs a t-test to study if mean(B) > mean(A).

The test is passed when the p-value of the t-test between (A) and (B) is below the critical quantile.

Example: For a credit scoring model, the test is passed when a decrease of wage by 10%, causes a statistically significant increase of the default probability.

Parameters:

model (BaseModel) – Model used to compute the test
dataset (Dataset) – Dataset used to compute the test
transformation_function (TransformationFunction) – Function performing the perturbations to be applied on dataset.
slicing_function (Optional[SlicingFunction]) – Slicing function to be applied on dataset (Default value = None)
critical_quantile (float) – Critical quantile above which the null hypothesis cannot be rejected
classification_label (str) – Only required for classification models.
debug (bool) – If True and the test fails, a dataset will be provided containing the non-increasing rows.

Returns:

actual_slices_size – Length of dataset tested
message – Test result message
metric – The t-test in terms of p-value between unchanged rows over the perturbed rows
passed – TRUE if the p-value of the t-test between (A) and (B) is below the critical value

Tests if the model predictions are statistically invariant when the feature values are perturbed.

It performs a t-test to study if mean(A) is between mean(B) - window_size/2 and mean(B) + window_size/2.

The test is passed when the following tests pass:

the p-value of the t-test between (A) and (B) + window_size/2 is below the critical quantile
the p-value of the t-test between (B) - window_size/2 and (A) is below the critical quantile

The test is passed when the p-value of the t-test between (A) and (B) is below the critical quantile.

Example: The test is passed when, by switching gender from male to female, the probability distributions remains statistically invariant. In other words, the test is passed if the mean of the perturbed sample is statistically within a window determined by the user.

Parameters:

model (BaseModel) – Model used to compute the test
dataset (Dataset) – Dataset used to compute the test
transformation_function (TransformationFunction) – Function performing the perturbations to be applied on dataset.
slicing_function (Optional[SlicingFunction]) – Slicing function to be applied on dataset (Default value = None)
window_size (float) – Probability window in which the mean of the perturbed sample can be in order to pass the test
critical_quantile (float) – Critical quantile above which the null hypothesis cannot be rejected
debug (bool) – If True and the test fails, a dataset will be provided containing the non-invariant rows.

Returns:

A TestResult object containing the test result.

Return type:

TestResult

Tests if the model prediction decreases when feature values are perturbed

Performs the Wilcoxon signed-rank test on two related samples. Sample (A) is constituted by the original predictions (probability of classification_label for classification models, or predicted value for regression models). Sample B is constituted by the predictions after perturbation of one or more features by tranformation_function.

This test computes the decreasing test to study if mean(B) < mean(A) The test is passed when the p-value of the Wilcoxon signed-rank test between (A) and (B) is below the critical quantile.

Example: For a credit scoring model, the test is passed when a decrease of wage by 10% causes a statistically significant probability decrease of the default probability.

Parameters:

model (BaseModel) – Model used to compute the test
dataset (Dataset) – Dataset used to compute the test
transformation_function (TransformationFunction) – Function performing the perturbations to be applied on dataset.
slicing_function (Optional[SlicingFunction]) – Slicing function to be applied on dataset
critical_quantile (float) – Critical quantile above which the null hypothesis cannot be rejected
classification_label (str) – Only required for classification models.
debug (bool) – If True and the test fails, a dataset will be provided containing the non-decreasing rows. (Default value = False)

Returns:

A TestResult object containing the test result.

Return type:

TestResult

Tests if the model prediction increases when feature values are perturbed

This test computes the decreasing test to study if mean(B) > mean(A) The test is passed when the p-value of the Wilcoxon signed-rank test between (A) and (B) is below the critical quantile.

Example: For a credit scoring model, the test is passed when a decrease of wage by 10% causes a statistically significant probability increase of the default probability.

Parameters:

model (BaseModel) – Model used to compute the test
dataset (Dataset) – Dataset used to compute the test
transformation_function (TransformationFunction) – Function performing the perturbations to be applied on dataset.
slicing_function (Optional[SlicingFunction]) – Slicing function to be applied on dataset
critical_quantile (float) – Critical quantile above which the null hypothesis cannot be rejected
classification_label (str) – Only required for classification models.
debug (bool) – If True and the test fails, a dataset will be provided containing the non-increasing rows.

Returns:

A TestResult object containing the test result.

Return type:

TestResult

Tests if the model predictions are statistically invariant when the feature values are perturbed

This test computes the equivalence test to show that mean(B) - window_size/2 < mean(A) < mean(B) + window_size/2

The test is passed when the following tests pass: - the p-value of the t-test between (A) and (B)+window_size/2 is below the critical quantile - the p-value of the t-test between (B)-window_size/2 and (A) is below the critical quantile

Parameters:

model (BaseModel) – Model used to compute the test
dataset (Dataset) – Dataset used to compute the test
transformation_function (TransformationFunction) – Function performing the perturbations to be applied on dataset.
slicing_function (Optional[SlicingFunction]) – Slicing function to be applied on dataset (Default value = None)
window_size (float) – Probability window in which the mean of the perturbed sample can be in
critical_quantile (float) – Critical quantile above which the null hypothesis cannot be rejected
debug (bool) – If True and the test fails, a dataset will be provided containing the non-invariant rows. (Default value = False)

Returns:

A TestResult object containing the test result.

Return type:

TestResult