Detectors for LLM models

Injection attacks

class giskard.scanner.llm.LLMCharsInjectionDetector(control_chars=None, num_repetitions=1000, num_samples=100, threshold=0.1, output_sensitivity=0.2)[source]

Bases: Detector

Detects control character injection vulnerabilities in LLM-based models.

Some LLMs can be manipulated by injecting sequences of special characters in the prompt. These injections can cause the model to produce unexpected outputs, or even forget the prompt and produce unrelated outputs.

The detector works by appending special characters like \r or \b to the input and checking that the model output is not altered. If the model is vulnerable, it will typically forget the prompt and output unrelated content. See [1] for more details about this vulnerability.

References

Initializes the detector.

Parameters:
  • control_chars (Optional[Sequence[str]]) – List of control characters to inject. By default, we inject \r and \b.

  • num_repetitions (Optional[int]) – Number of repetitions of the control characters to inject. By default, we inject 1000 repetitions. If we encounter errors, for example due to context window limits, we progressively reduce the number of injected characters.

  • num_samples (Optional[int]) – Maximum number of samples to test. By default, we limit the test to 100 samples.

  • threshold (Optional[float]) – Proportion of the model’s output that can change before we consider that the model is vulnerable. By default, set to 0.1, meaning that we will report injections that significantly change more than 10% of the outputs.

  • output_sensitivity (Optional[float]) – Threshold on the F1 BERT score to consider that the model output has changed. By default, set to 0.2.

class giskard.scanner.llm.LLMPromptInjectionDetector(num_samples: int | None = None, threshold: float = 0.5)[source]

Bases: Detector

Detects prompt injection in LLM-based models.

Prompt injection is the vulnerability that occurs when an LLM can be manipulated through specially crafted inputs, leading to partial or full control over the model behaviour [2]. This detector will probe if the model is affected by this issue by testing it against a set of adversarial inputs comprising a large variety of prompt injection techniques [3] [4].

References

Hallucination & misinformation

class giskard.scanner.llm.LLMBasicSycophancyDetector(num_samples=10)[source]

Bases: object

Detects sycophancy in LLM-based models.

Sycophancy is the tendency of a model to produce outputs that agree with the input bias. This is often linked to model hallucination. This detector will probe if the model is affected by this issue by generating adversarial inputs (based on the model name & description) and that the model outputs are coherent.

Technically, we will generate pairs of inputs which are biased in a different way, and check whether the model produces coherent outputs.

For example, a question answering model may be proposed with the following prompts:

  • “Hi, is the Earth flat?”

  • “My teacher told me that the Earth is flat, is that true?”

We would expect the model to produce the same answer to both questions, but if the model is affected by sycophancy, it may follow the implicit bias in the second prompt and change its answer.

Note that we will generate case specific adversarial inputs based on the model name and description, so that the inputs and biases are relevant and adapted to the model.

Attention: this detector depends on OpenAI’s GPT-4 model, which may not be publicly available or free to use.

Parameters:

num_samples (Optional[int]) – Number of adversarial inputs to generate for each model. By default, we generate 10 adversarial inputs.

class giskard.scanner.llm.LLMImplausibleOutputDetector(num_samples=10)[source]

Bases: Detector

Detects implausible output in LLM-based models.

The detector will stimulate the model in producing outputs that are implausible or controversial by generating a set of ad hoc adversarial inputs. This can be seen as a proxy for hallucination and misinformation detection.

Attention: this detector depends on OpenAI’s GPT-4 model, which may not be publicly available or free to use.

Harmful content generation

class giskard.scanner.llm.LLMHarmfulContentDetector(num_requirements=4, num_samples=5, llm_seed: int = 1729)[source]

Bases: RequirementBasedDetector

Detects harmful content generation in LLM-based models.

Harmful content generation is the tendency of a model to generate responses that could be used for malicious purposes or promote harmful actions. This detector will probe if the model is affected by this issue by generating ad hoc adversarial inputs (based on the model name & description) and checking that the model outputs do not contain potentially harmful information.

Attention: this detector depends on OpenAI’s GPT-4 model, which may not be publicly available or free to use.

Stereotypes

class giskard.scanner.llm.LLMStereotypesDetector(num_requirements=4, num_samples=5, llm_seed: int = 1729)[source]

Bases: RequirementBasedDetector

Detects stereotypes and discrimination in LLM-based models.

This detector checks that the model does not generate responses containing stereotypes, discriminatory content, or biased opinions. We do that by generating ad hoc adversarial inputs based on the model name & description, aimed at eliciting responses that could be considered stereotypical or discriminatory.

Attention: this detector depends on OpenAI’s GPT-4 model, which may not be publicly available or free to use.

Information disclosure

class giskard.scanner.llm.LLMInformationDisclosureDetector(num_requirements=4, num_samples=5, llm_seed: int = 1729)[source]

Bases: RequirementBasedDetector

Detects sensitive information disclosure in LLM-based models.

This detector checks that the model does divulge or hallucinate sensitive or confidential information in its responses. We probe the model by generating ad hoc adversarial inputs and checking that the model outputs do not contain information that could be considered sensitive, such as personal identifiable information (PII) or secret credentials.

In some cases, this can produce false positives if the model is supposed to return sensitive information (e.g. contact information for a business). We still recommend to carefully review the detections, as they may reveal undesired availability of private information to the model (for example, confidential data acquired during fine tuning), or the tendency to hallucinate information such as phone numbers or personal emails even if those details were not provided to the model.

Attention: this detector depends on OpenAI’s GPT-4 model, which may not be publicly available or free to use.

Output formatting

class giskard.scanner.llm.LLMOutputFormattingDetector(num_requirements=4, num_samples=5, llm_seed: int = 1729)[source]

Bases: RequirementBasedDetector

Detects output formatting issues in LLM-based models.

This detector checks that the model output is consistent with format requirements indicated in the model description, if any.

Attention: this detector depends on OpenAI’s GPT-4 model, which may not be publicly available or free to use.