Data quality tests#

giskard.testing.test_data_uniqueness(dataset: SuiteInput | Dataset | None = None, column: SuiteInput | str | None = None, threshold: SuiteInput | float | None = 0.8) GiskardTestMethod[source]#

Test for checking the uniqueness of data in a column.

Parameters:
  • dataset (Dataset) โ€“ The dataset to test.

  • column (str) โ€“ The column to check for uniqueness.

  • threshold (float, optional) โ€“ The minimum uniqueness ratio for the test to pass., by default 0.8

Returns:

The result of the test.

Return type:

TestResult

giskard.testing.test_data_completeness(dataset: SuiteInput | Dataset | None = None, column_name: SuiteInput | str | None = None, threshold: SuiteInput | float | None = None) GiskardTestMethod[source]#

Test for checking the completeness of data in a dataset.

Parameters:
  • dataset (Dataset) โ€“ The dataset to test.

  • column_name (str) โ€“ The name of the column to test.

  • threshold (float) โ€“ The minimum completeness ratio for the test to pass.

Returns:

A TestResult object indicating whether the test passed and the completeness ratio.

Return type:

TestResult

giskard.testing.test_valid_range(dataset: SuiteInput | Dataset | None = None, column: SuiteInput | str | None = None, min_value: SuiteInput | float | None = None, max_value: SuiteInput | float | None = None) GiskardTestMethod[source]#

Test for checking if data in a column falls within a specified range.

Parameters:
  • dataset (Dataset) โ€“ The dataset to test

  • column (str) โ€“ The column to check

  • min_value (float, optional) โ€“ The minimum valid value, by default None

  • max_value (float, optional) โ€“ The maximum valid value, by default None

Returns:

The result of the test

Return type:

TestResult

giskard.testing.test_valid_values(dataset: SuiteInput | Dataset | None = None, column: SuiteInput | str | None = None, valid_values: SuiteInput | List | None = None) GiskardTestMethod[source]#

Test for checking if data in a column is in a set of valid values.

Parameters:
  • dataset (Dataset) โ€“ The dataset to test

  • column (str) โ€“ The column to check

  • valid_values (Optional[List], optional) โ€“ A list of valid values, by default None

Returns:

The result of the test

Return type:

TestResult

giskard.testing.test_data_correlation(dataset: SuiteInput | Dataset | None = None, column1: SuiteInput | str | None = None, column2: SuiteInput | str | None = None, should_correlate: SuiteInput | bool | None = True, correlation_threshold: SuiteInput | float | None = 0.0) GiskardTestMethod[source]#

Test for analyzing correlations between two specific features.

Parameters:
  • dataset (Dataset) โ€“ The dataset to test

  • column1 (str, optional) โ€“ The first column to check, by default None

  • column2 (str, optional) โ€“ The second column to check, by default None

  • should_correlate (bool, optional) โ€“ Whether the two columns should correlate, by default True

  • correlation_threshold (float, optional) โ€“ The minimum absolute correlation that is considered significant, by default 0.0

Returns:

The result of the test, containing the correlation between the two columns

Return type:

TestResult

giskard.testing.test_outlier_value(dataset: SuiteInput | Dataset | None = None, column: SuiteInput | str | None = None, eps: SuiteInput | float | None = 0.5, min_samples: SuiteInput | int | None = 5) GiskardTestMethod[source]#

Test for identifying outliers or anomalies in a column of the dataset using DBSCAN.

Parameters:
  • dataset (Dataset) โ€“ The dataset to test

  • column (str) โ€“ The column to check for anomalies

  • eps (float, optional) โ€“ The maximum distance between two samples for one to be considered as in the neighborhood of the other, by default 0.5

  • min_samples (int, optional) โ€“ The number of samples in a neighborhood for a point to be considered as a core point, by default 5

Returns:

The result of the test, containing the indices of the anomalies

Return type:

TestResult

giskard.testing.test_foreign_constraint(dataset: SuiteInput | Dataset | None = None, column: SuiteInput | str | None = None, target_dataset: SuiteInput | Dataset | None = None, target_column: SuiteInput | str | None = None, threshold: SuiteInput | float | None = 0.0) GiskardTestMethod[source]#

Ensure that all data in a column of one dataset are present in a column of another dataset.

Parameters:
  • dataset (Dataset) โ€“ The dataset to check

  • column (str) โ€“ The column in the dataset to check

  • target_dataset (Dataset) โ€“ The dataset to compare against

  • target_column (str) โ€“ The column in the target dataset to compare against

  • threshold (float, optional) โ€“ The maximum allowed ratio of missing values, by default 0.0

Returns:

The result of the test, indicating whether the test passed and the ratio of missing values

Return type:

TestResult

giskard.testing.test_label_consistency(dataset: SuiteInput | Dataset | None = None, label_column: SuiteInput | str | None = None) GiskardTestMethod[source]#

Test for checking the consistency of datatype across each label throughout dataset.

Parameters:
  • dataset (Dataset) โ€“ The dataset to test

  • label_column (str) โ€“ The column containing the labels

Returns:

The result of the test

Return type:

TestResult

giskard.testing.test_mislabeling(dataset: SuiteInput | Dataset | None = None, labelled_column: SuiteInput | str | None = None, reference_columns: SuiteInput | Iterable[str] | None = None) GiskardTestMethod[source]#

Test for detecting mislabelled data

Parameters:
  • dataset (Dataset) โ€“ The dataset to test

  • labelled_column (str) โ€“ The column containing the labels

  • reference_columns (Iterable[str]) โ€“ The columns containing the data to check for consistency

Returns:

The result of the test, containing the indices of the mislabelled data

Return type:

TestResult

giskard.testing.test_feature_importance(dataset: SuiteInput | Dataset | None = None, feature_columns: SuiteInput | Iterable[str] | None = None, target_column: SuiteInput | str | None = None, importance_threshold: SuiteInput | float | None = 0) GiskardTestMethod[source]#

Test for analyzing the importance of features in a classification problem

Parameters:
  • dataset (Dataset) โ€“ The dataset to test

  • feature_columns (Iterable[str]) โ€“ The columns containing the features

  • target_column (str) โ€“ The column containing the target variable

  • importance_threshold (float, optional) โ€“ The minimum importance that is considered significant, by default 0

Returns:

The result of the test, containing the feature importances

Return type:

TestResult

giskard.testing.test_class_imbalance(dataset: SuiteInput | Dataset | None = None, target_column: SuiteInput | str | None = None, lower_threshold: SuiteInput | float | None = None, upper_threshold: SuiteInput | float | None = None) GiskardTestMethod[source]#

Test for assessing the distribution of classes in classification problems.

Parameters:
  • dataset (Dataset) โ€“ The dataset to test

  • target_column (str) โ€“ The column containing the target variable

  • lower_threshold (float) โ€“ The minimum allowed class proportion

  • upper_threshold (float) โ€“ The maximum allowed class proportion

Returns:

The result of the test, containing the class proportions

Return type:

TestResult