Statistical tests#
- giskard.testing.test_right_label(model: SuiteInput | BaseModel | None = None, dataset: SuiteInput | Dataset | None = None, classification_label: SuiteInput | str | None = None, slicing_function: SuiteInput | SlicingFunction | None = None, threshold: SuiteInput | float | None = 0.5, debug: SuiteInput | bool | None = False) GiskardTestMethod [source]#
Summary: Test if the model returns the right classification label for a slice
Description: The test is passed when the percentage of rows returning the right classification label is higher than the threshold in a given slice
Example: For a credit scoring model, the test is passed when more than 50% of people with high-salaries are classified as “non default”
- Parameters:
model (BaseModel) – Model used to compute the test
dataset (Dataset) – Dataset used to compute the test
classification_label (str) – Classification label you want to test
slicing_function (Optional[SlicingFunction]) – Slicing function to be applied on the dataset
threshold (float) – Threshold for the percentage of passed rows
debug (bool) – If True and the test fails, a dataset will be provided containing the rows that do not return the right classification label.
- Returns:
Length of dataset tested metrics:
The ratio of rows with the right classification label over the total of rows in the slice
- passed:
TRUE if passed_ratio > threshold
- Return type:
actual_slices_size
- giskard.testing.test_output_in_range(model: SuiteInput | BaseModel | None = None, dataset: SuiteInput | Dataset | None = None, slicing_function: SuiteInput | SlicingFunction | None = None, classification_label: SuiteInput | str | None = None, min_range: SuiteInput | float | None = 0.3, max_range: SuiteInput | float | None = 0.7, threshold: SuiteInput | float | None = 0.5, debug: SuiteInput | bool | None = False) GiskardTestMethod [source]#
Summary: Test if the model output belongs to the right range for a slice
Description: - The test is passed when the ratio of rows in the right range inside the slice is higher than the threshold.
For classification: Test if the predicted probability for a given classification label belongs to the right range for a dataset slice
For regression : Test if the predicted output belongs to the right range for a dataset slice
Example : For Classification: For a credit scoring model, the test is passed when more than 50% of people with high wage have a probability of defaulting between 0 and 0.1
For Regression : The predicted Sale Price of a house in the city falls in a particular range
- Parameters:
model (BaseModel) – Model used to compute the test
dataset (Dataset) – Dataset used to compute the test
slicing_function (Optional[SlicingFunction]) – Slicing function to be applied on the dataset
classification_label (Optional[str]) – Optional. Classification label you want to test
min_range (float) – Minimum probability of occurrence of classification label
max_range (float) – Maximum probability of occurrence of classification label
threshold (float) – Threshold for the percentage of passed rows
debug (bool) – If True and the test fails, a dataset will be provided containing the rows that are out of the given range.
- Returns:
Length of dataset tested metrics:
The proportion of rows in the right range inside the slice
- passed:
TRUE if metric > threshold
- Return type:
actual_slices_size
- giskard.testing.test_disparate_impact(model: SuiteInput | BaseModel | None = None, dataset: SuiteInput | Dataset | None = None, protected_slicing_function: SuiteInput | SlicingFunction | None = None, unprotected_slicing_function: SuiteInput | SlicingFunction | None = None, positive_outcome: SuiteInput | str | None = None, slicing_function: SuiteInput | SlicingFunction | None = None, min_threshold: SuiteInput | float | None = 0.8, max_threshold: SuiteInput | float | None = 1.25, debug: SuiteInput | bool | None = False) GiskardTestMethod [source]#
Summary: Tests if the model is biased more towards an unprotected slice of the dataset over a protected slice. Note that this test reflects only a possible bias in the model while being agnostic to any biaas in the dataset it trained on. The Disparate Impact (DI) is only valid for classification models and is computed as the ratio between the average count of correct predictions for the protected slice over the unprotected one given a certain positive_outcome.
Description: Calculate the Disparate Impact between a protected and unprotected slice of a dataset. Otherwise known as the “80 percent” rule, the Disparate Impact determines if a model was having an “adverse impact” on a protected (or minority in some cases) group.
Example: The rule was originally based on the rates at which job applicants were hired. For example, if XYZ Company hired 50 percent of the men applying for work in a predominantly male occupation while hiring only 20 percent of the female applicants, one could look at the ratio of those two hiring rates to judge whether there might be a discrimination problem. The ratio of 20:50 means that the rate of hiring for female applicants is only 40 percent of the rate of hiring for male applicants. That is, 20 divided by 50 equals 0.40, which is equivalent to 40 percent. Clearly, 40 percent is well below the 80 percent that was arbitrarily set as an acceptable difference in hiring rates. Therefore, in this example, XYZ Company could have been called upon to prove that there was a legitimate reason for hiring men at a rate so much higher than the rate of hiring women.
- Parameters:
model (BaseModel) – Model used to compute the test
dataset (Dataset) – Dataset used to compute the test
protected_slicing_function (SlicingFunction) – Slicing function that defines the protected group from the full dataset given
unprotected_slicing_function (SlicingFunction) – Slicing function that defines the unprotected group from the full dataset given
positive_outcome (str) – The target value that is considered a positive outcome in the dataset
slicing_function (Optional[SlicingFunction]) – Slicing function to be applied on the dataset
min_threshold (float) – Threshold below which the DI test is considered to fail, by default 0.8
max_threshold (float) – Threshold above which the DI test is considered to fail, by default 1.25
debug (bool) – If True and the test fails, a dataset will be provided containing the rows from the protected and unprotected slices that were incorrectly predicted on a specific positive outcome.
- Returns:
The disparate impact ratio passed:
TRUE if the disparate impact ratio > min_threshold && disparate impact ratio < max_threshold
- Return type:
metric
- giskard.testing.test_nominal_association(model: SuiteInput | BaseModel | None = None, dataset: SuiteInput | Dataset | None = None, slicing_function: SuiteInput | SlicingFunction | None = None, method: SuiteInput | str | None = 'theil_u', threshold: SuiteInput | float | None = 0.5, debug: SuiteInput | bool | None = False) GiskardTestMethod [source]#
Summary: A statistical test for nominal association between the dataset slice and the model predictions. It aims to determine whether there is a significant relationship or dependency between the two. It assesses whether the observed association is likely to occur by chance or if it represents a true association.
Description: The general procedure involves setting up a null hypothesis that assumes no association between the variables and an alternative hypothesis that suggests an association exists. The statistical test is calculated based on three methods: “theil_u”, “cramer_v” and “mutual_information”.
- Parameters:
model (BaseModel) – Model used to compute the test
dataset (Dataset) – Dataset used to compute the test
slicing_function (SlicingFunction) – Slicing function to be applied on the dataset
method (Optional[str]) – The association test statistic. Choose between “theil_u”, “cramer_v”, and “mutual_information”. (default = “theil_u”)
threshold (float) – Threshold value for the Cramer’s V score
debug (bool) – If True and the test fails, a dataset will be provided containing the rows of the dataset slice.
- giskard.testing.test_cramer_v(model: SuiteInput | BaseModel | None = None, dataset: SuiteInput | Dataset | None = None, slicing_function: SuiteInput | SlicingFunction | None = None, threshold: SuiteInput | float | None = 0.5, debug: SuiteInput | bool | None = False) GiskardTestMethod [source]#
Summary: Cramer’s V is a statistical measure used to assess the strength and nature of association between two categorical variables. It is an extension of the chi-squared test for independence and takes into account the dimensions of the contingency table. Cramer’s V ranges from 0 to 1, where 0 indicates no association and 1 indicates a perfect association.
Description: Cramer’s V is particularly useful for analyzing nominal data and understanding the relationship between categorical variables. It’s a normalized version of the chi-squared statistic that considers the dimensions of the contingency table. The formula adjusts for the number of observations and the number of categories in the variables to provide a more interpretable measure of association. Mathematically, the Cramer’s V metric can be expressed as:
\[V = \sqrt{\frac{\chi^2}{n \cdot \min(k-1, r-1)}}\]where: \(\chi^2\) is the chi-squared statistic for the two variables. n is the total number of observations. \(k\) is the number of categories in one variable. \(r\) is the number of categories in the other variable.
- Parameters:
model (BaseModel) – Model used to compute the test
dataset (Dataset) – Dataset used to compute the test
slicing_function (SlicingFunction) – Slicing function to be applied on the dataset
threshold (float) – Threshold value for the Cramer’s V score
debug (bool) – If True and the test fails, a dataset will be provided containing the rows of the dataset slice.
- giskard.testing.test_mutual_information(model: SuiteInput | BaseModel | None = None, dataset: SuiteInput | Dataset | None = None, slicing_function: SuiteInput | SlicingFunction | None = None, threshold: SuiteInput | float | None = 0.5, debug: SuiteInput | bool | None = False) GiskardTestMethod [source]#
Summary: The mutual information statistical test is a measure used to quantify the degree of association between two categorical variables. It assesses how much information about one variable can be gained from knowing the other variable’s value. Mutual information is based on the concept of entropy and provides a way to determine the level of dependency or correlation between categorical variables.
Description: Mutual information measures the reduction in uncertainty about one variable given knowledge of the other variable. It takes into account both individual and joint distributions of the variables and provides a value indicating how much information is shared between them. Higher mutual information values suggest stronger association, while lower values indicate weaker or no association. Mathematically, the mutual information metric can be expressed as:
\[I(X;Y) = \sum_{x \in X} \sum_{y \in Y} p(x, y) \cdot \log ( \frac{p(x, y)}{p(x) \cdot p(y)})\]where: \(p(x,y)\) is the joint probability mass function of variables \(X\) and \(Y\). \(p(x)\) and \(p(y)\) are the marginal probability mass functions of variables \(X\) and \(Y\) respectively.
- Parameters:
model (BaseModel) – Model used to compute the test
dataset (Dataset) – Dataset used to compute the test
slicing_function (SlicingFunction) – Slicing function to be applied on the dataset
threshold (float) – Threshold value for the mutual information score
debug (bool) – If True and the test fails, a dataset will be provided containing the rows of the dataset slice.
- giskard.testing.test_theil_u(model: SuiteInput | BaseModel | None = None, dataset: SuiteInput | Dataset | None = None, slicing_function: SuiteInput | SlicingFunction | None = None, threshold: SuiteInput | float | None = 0.5, debug: SuiteInput | bool | None = False) GiskardTestMethod [source]#
Summary: Theil’s U statistical test for nominal association is a measure used to assess the strength and direction of association between two categorical variables. It quantifies the inequality in the distribution of one variable relative to the distribution of the other variable, providing insights into the pattern of association between them. Theil’s U ranges from 0 to 1, where 0 indicates no association, and 1 indicates a perfect association.
Description: Theil’s U for nominal association is commonly used to analyze the relationships between variables like ethnicity, gender, or occupation. It considers the proportions of one variable’s categories within each category of the other variable. The calculation involves comparing the observed joint distribution of the two variables with what would be expected if there were no association. Mathematically, Theil’s U for nominal association can be expressed as:
\[U = \frac{H(x|y) - H(y|x)}{H(x)}\]where \(H(x|y)\), \(H(y|x)\) are the conditional entropies of the two variables and \(H(x)\) is the entropy of the first variable.
- Parameters:
model (BaseModel) – Model used to compute the test
dataset (Dataset) – Dataset used to compute the test
slicing_function (SlicingFunction) – Slicing function to be applied on the dataset
threshold (float) – Threshold value for the Theil’s U score
debug (bool) – If True and the test fails, a dataset will be provided containing the rows of the dataset slice.