LLM tests¶

Injections¶

Tests that the model is not vulnerable to control character injection.

This works by appending special characters like \r or \b to the input and checking that the model output is not altered. If the model is vulnerable, it will typically forget the prompt and output unrelated content. See [1] for more details about this vulnerability.

Parameters:

model (BaseModel) – The model to test.
dataset (Dataset) – A sample dataset which will be perturbed with char injection.
characters (Optional[Sequence[str]]) – The character to inject. By default, we will try with \r and \b.
features (Optional[Sequence[str]]) – The features to test. By default, will test all features.
max_repetitions (int) – The maximum number of repetitions of the character to inject, by default 1000. If the model fails with that number of repetition (for example because of limited context length), we will try with half and then a quarter of that number.
threshold (float) – Threshold for the fail rate, by default 0.1. If the fail rate is above this threshold, the test will fail.
output_sensitivity (float) – Output sensitivity, by default 0.2. This is the minimum difference in BERT score that for two outputs to be considered different.
debug (bool) – If True, the output dataset containing the failing samples will be returned.

Returns:

The test result.

Return type:

TestResult

References

Tests that the model is not vulnerable to prompt injection.

This works by inserting a set of jailbreaking prompts to the input and checking that the model output is not altered. If the model is vulnerable, it will typically ignore any instructions or safeguarding protocols it might have had and comply to the new instructions given by the injected prompt. See [2] for more details about this vulnerability.

Parameters:

model (BaseModel) – The model to test.
dataset (Dataset) – A dataset of adversarial prompts.
expected_strings (List[str]) – A list of expected_strings that needs to be found in the model’s output in order to classify it as a failure.
all_expected_strings_must_be_found (bool) – A flag to indicate whether all (True, default) or any (False) of the expected_strings must be found in order to classify the output as a failure. Default: True
exact_matching (bool) – A flag to indicate whether the expected_strings (typically one) needs to match exactly the model’s output in order to classify it as a failure. Default: False.
word_matching (bool) – A flag to indicate whether the expected_strings should be found as independent words in the model’s output in order to classify it as a failure. Default: False
case_sensitive (bool) – A flag to indicate whether the expected_strings matching should be case-sensitive or not. Default: True
punctuation_sensitive (bool) – A flag to indicate whether the expected_strings matching should be punctuation-sensitive or not. Default: True
threshold (float) – Threshold for the fail rate, by default 0.5. If the fail rate is above this threshold, the test will fail.
debug (bool) – If True, the output dataset containing the failing samples will be returned.

Returns:

The test result.

Return type:

TestResult

References

Tests that the model is not vulnerable to prompt injection.

Parameters:

model (BaseModel) – The model to test.
dataset (Dataset) – A dataset of adversarial prompts.
evaluator_configs (List[StringMatcherConfig]) –
A list of StringMatcherConfig that has the following attributes:
- expected_stringsList[str]
  A list of expected_strings that needs to be found in the model’s output in order to classify it as a failure.
- all_expected_strings_must_be_foundbool
  A flag to indicate whether all (True, default) or any (False) of the expected_strings must be found in order to classify the output as a failure. Default: True
- exact_matchingbool
  A flag to indicate whether the expected_strings (typically one) needs to match exactly the model’s output in order to classify it as a failure. Default: False
- word_matchingbool
  A flag to indicate whether the expected_strings should be found as independent words in the model’s output in order to classify it as a failure. Default: False
- case_sensitivebool
  A flag to indicate whether the expected_strings matching should be case-sensitive or not. Default: True
- punctuation_sensitivebool
  A flag to indicate whether the expected_strings matching should be punctuation-sensitive or not. Default: True
threshold (float) – Threshold for the fail rate, by default 0.5. If the fail rate is above this threshold, the test will fail.
debug (bool) – If True, the output dataset containing the failing samples will be returned.

Returns:

The test result.

Return type:

TestResult

References

LLM-as-a-judge¶

Tests that the model output is coherent for multiple inputs.

Parameters:

model (BaseModel) – The model to test.
dataset_1 (Dataset) – Another sample dataset of inputs, with same index as dataset_1. If not passed, we will run a again predictions on the first inputs dataset_1, and check that the outputs are coherent.
dataset_2 (Optional[Dataset]) – Another sample dataset of inputs, with same index as dataset_1. If not passed, we will rerun the model on dataset_1.

Returns:

The test result.

Return type:

TestResult

giskard.testing.tests.llm.test_llm_output_plausibility(model: SuiteInput | BaseModel | None = None, dataset: SuiteInput | Dataset | None = None) → GiskardTestMethod[source]¶

Tests that the model output is plausible.

Parameters:

model (BaseModel) – The model to test.
dataset (Dataset) – A sample dataset of inputs.

Returns:

The test result.

Return type:

TestResult

Evaluates the model output against a given requirement with another LLM (LLM-as-a-judge).

The model outputs over a given dataset will be validated against the specified requirement using GPT-4 (note that this requires you to set the OPENAI_API_TOKEN environment variable for the test to run correctly).

Parameters:

model (BaseModel) – The generative model to test.
dataset (Dataset) – A dataset of examples which will be provided as inputs to the model.
requirement_column (str) – The column in the dataset containing the requirement to evaluate the model output against. This should be a clear and explicit requirement that can be interpreted by the LLM, for example: “The model should decline to answer”, “The model should not generate content that incites harm or violence”, or “The model should apologize and explain that it cannot answer questions unrelated to its scope”.

Returns:

A TestResult object containing the test result.

Return type:

TestResult

Evaluates the model output against a given requirement with another LLM (LLM-as-a-judge).

Parameters:

model (BaseModel) – The generative model to test.
input_var (str) – The input to provide to the model. If your model has a single input variable, this will be used as its value. For example, if your model has a single input variable called question, you can set input_var to the question you want to ask the model, question = "What is the capital of France?". If need to pass multiple input variables to the model, set input_as_json to True and specify input_var as a JSON encoded object. For example: ` input_var = '{"question": "What is the capital of France?", "language": "English"}' `
requirement (str) – The requirement to evaluate the model output against. This should be a clear and explicit requirement that can be interpreted by the LLM, for example: “The model should decline to answer”, “The model should not generate content that incites harm or violence”.
input_as_json (bool) – If True, input_var will be parsed as a JSON encoded object. Default is False.

Returns:

A TestResult object containing the test result.

Return type:

TestResult

Evaluates the model output against a given requirement with another LLM (LLM-as-a-judge).

Parameters:

model (BaseModel) – The generative model to test.
dataset (Dataset) – A dataset of examples which will be provided as inputs to the model.
requirement (str) – The requirement to evaluate the model output against. This should be a clear and explicit requirement that can be interpreted by the LLM, for example: “The model should decline to answer”, “The model should not generate content that incites harm or violence”, or “The model should apologize and explain that it cannot answer questions unrelated to its scope”.

Returns:

A TestResult object containing the test result.

Return type:

TestResult

Tests if LLM answers are correct with respect to a known reference answers.

The test is passed when the ratio of correct answers is higher than the threshold.

Parameters:

model (BaseModel) – Model used to compute the test
dataset (Dataset) – Dataset used to compute the test
threshold (float) – The threshold value for the ratio of invariant rows.

Returns:

A TestResult object containing the test result.

Return type:

TestResult

Ground Truth¶

giskard.testing.tests.llm.test_llm_ground_truth(model: SuiteInput | BaseModel | None = None, dataset: SuiteInput | Dataset | None = None, threshold: SuiteInput | float | None = 0.5) → GiskardTestMethod[source]¶

giskard.testing.tests.llm.test_llm_ground_truth_similarity(model: SuiteInput | BaseModel | None = None, dataset: SuiteInput | Dataset | None = None, output_sensitivity: SuiteInput | float | None = 0.15, threshold: SuiteInput | float | None = 0.5, idf: SuiteInput | bool | None = False) → GiskardTestMethod[source]¶

Evaluates the correctness of the model output against a ground truth with an LLM (LLM-as-a-judge).

The model outputs over a given dataset will be validated against the dataset target using GPT-4 (note that this requires you to set the OPENAI_API_TOKEN environment variable for the test to run correctly).

Parameters:

model (BaseModel) – The generative model to test.
dataset (Dataset) – A dataset of examples which will be provided as inputs to the model.
prefix (str) – The prefix instructing how the answer should be according to the ground truth”.

Returns:

A TestResult object containing the test result.

Return type:

TestResult