Data quality tests¶

Test for checking the uniqueness of data in a column.

Parameters:

dataset (Dataset) – The dataset to test.
column (str) – The column to check for uniqueness.
threshold (float, optional) – The minimum uniqueness ratio for the test to pass., by default 0.8

Returns:

The result of the test.

Return type:

TestResult

Test for checking the completeness of data in a dataset.

Parameters:

dataset (Dataset) – The dataset to test.
column_name (str) – The name of the column to test.
threshold (float) – The minimum completeness ratio for the test to pass.

Returns:

A TestResult object indicating whether the test passed and the completeness ratio.

Return type:

TestResult

Test for checking if data in a column falls within a specified range.

Parameters:

dataset (Dataset) – The dataset to test
column (str) – The column to check
min_value (float, optional) – The minimum valid value, by default None
max_value (float, optional) – The maximum valid value, by default None

Returns:

The result of the test

Return type:

TestResult

Test for checking if data in a column is in a set of valid values.

Parameters:

dataset (Dataset) – The dataset to test
column (str) – The column to check
valid_values (Optional[List], optional) – A list of valid values, by default None

Returns:

The result of the test

Return type:

TestResult

Test for analyzing correlations between two specific features.

Parameters:

dataset (Dataset) – The dataset to test
column1 (str, optional) – The first column to check, by default None
column2 (str, optional) – The second column to check, by default None
should_correlate (bool, optional) – Whether the two columns should correlate, by default True
correlation_threshold (float, optional) – The minimum absolute correlation that is considered significant, by default 0.0

Returns:

The result of the test, containing the correlation between the two columns

Return type:

TestResult

Test for identifying outliers or anomalies in a column of the dataset using DBSCAN.

Parameters:

dataset (Dataset) – The dataset to test
column (str) – The column to check for anomalies
eps (float, optional) – The maximum distance between two samples for one to be considered as in the neighborhood of the other, by default 0.5
min_samples (int, optional) – The number of samples in a neighborhood for a point to be considered as a core point, by default 5

Returns:

The result of the test, containing the indices of the anomalies

Return type:

TestResult

Ensure that all data in a column of one dataset are present in a column of another dataset.

Parameters:

dataset (Dataset) – The dataset to check
column (str) – The column in the dataset to check
target_dataset (Dataset) – The dataset to compare against
target_column (str) – The column in the target dataset to compare against
threshold (float, optional) – The maximum allowed ratio of missing values, by default 0.0

Returns:

The result of the test, indicating whether the test passed and the ratio of missing values

Return type:

TestResult

giskard.testing.test_label_consistency(dataset: SuiteInput | Dataset | None = None, label_column: SuiteInput | str | None = None) → GiskardTestMethod[source]¶

Test for checking the consistency of datatype across each label throughout dataset.

Parameters:

dataset (Dataset) – The dataset to test
label_column (str) – The column containing the labels

Returns:

The result of the test

Return type:

TestResult

Test for detecting mislabelled data

Parameters:

dataset (Dataset) – The dataset to test
labelled_column (str) – The column containing the labels
reference_columns (Iterable[str]) – The columns containing the data to check for consistency

Returns:

The result of the test, containing the indices of the mislabelled data

Return type:

TestResult

Test for analyzing the importance of features in a classification problem

Parameters:

dataset (Dataset) – The dataset to test
feature_columns (Iterable[str]) – The columns containing the features
target_column (str) – The column containing the target variable
importance_threshold (float, optional) – The minimum importance that is considered significant, by default 0

Returns:

The result of the test, containing the feature importances

Return type:

TestResult

Test for assessing the distribution of classes in classification problems.

Parameters:

dataset (Dataset) – The dataset to test
target_column (str) – The column containing the target variable
lower_threshold (float) – The minimum allowed class proportion
upper_threshold (float) – The maximum allowed class proportion

Returns:

The result of the test, containing the class proportions

Return type:

TestResult