AI safety testing and defence

This article explains how Eddy AI is tested and defended against misuse, harmful content, and adversarial attacks. All testing described here is done for defensive purposes - to make Eddy AI safer and more reliable for you.

Moderation API

What it is:

An automated safety filter that screens both incoming user prompts and outgoing AI responses for harmful content — including violent, sexual, hateful, or self-harm-related content, and other policy-restricted material.

How we use it:

The Moderation API is called on every user prompt and every candidate response.
If content is flagged, Eddy AI blocks the response, redacts the content, or routes to safe fallback copy.
All blocked prompts are logged for compliance review.

What this means for customer:

Reduces the risk of unsafe or policy-violating responses appearing in your knowledge base. Helps you enforce acceptable-use policies for your end users.

Operational controls:

Pre-prompt and post-response moderation applied on all interactive surfaces.
Severity tags applied in logs for compliance review.
Periodic threshold tuning to balance detection accuracy against false positives.

DAN-Style Jailbreak Testing

What it is:

"DAN" (Do-Anything-Now) is a well-known type of jailbreak attack. It tries to coerce an AI model to ignore its instructions, policies, or permissions — often using role-play, policy override commands, or nested instruction payloads. We use "DAN" as shorthand for this entire class of prompt-engineering attacks.

How we use it:

We maintain a corpus of DAN-style and related jailbreak prompts.
These are run against staging environments as part of our red teaming practice.
We use the results to harden system prompts, isolate retrieved context, and verify that the model refuses to act outside its defined scope.

What this means for customer:

Eddy AI's responses stay grounded in your authorized knowledge base content. Instruction override attacks are blocked.

Operational controls:

Regular replay of jailbreak test sets (including DAN variants) before every release.
Guardrail rules that nullify "ignore previous instructions" and similar override commands.
Automated alerts triggered on any successful jailbreak in pre-production — deployment is blocked until the issue is resolved.

NOTE

Document360 does not endorse or enable any "DAN mode." All references to DAN are strictly about defensive testing.

Adversarial Testing

What it is:

Structured, systematic attempts to break or degrade Eddy AI using hostile inputs. This includes token stuffing, prompt injections, context contamination, unicode/encoding tricks, logit attacks, and denial-of-wallet prompts (inputs designed to exhaust compute resources).

How we use it:

Continuous adversarial tests run across the full RAG (Retrieval-Augmented Generation) pipeline: retrieval, ranking, grounding, and response generation.
Tests include injection strings planted in user prompts and in knowledge base content to validate isolation and output sanitization.

What this means for customer:

Improves Eddy AI's robustness against manipulation, reduces hallucination risk, and protects system performance and cost under abuse conditions.

Operational controls:

Scheduled adversarial test runs before every release; additional runs after major model or prompt updates.
Metrics tracked:
- Jailbreak success rate
- Injection pass-through rate
- Grounded-answer rate
- Refusal accuracy
- Latency and compute spike alerts
Findings feed back directly into prompt policies, retriever filters, and content sanitizers.

LLM Observability and Performance Evaluation

Eddy AI uses the following evaluation frameworks to track accuracy and performance:

Framework	Purpose
OpenAI Evals	Evaluate model performance against defined benchmarks
RAGAS	Assess retrieval quality and answer grounding in RAG pipelines
GeneralQA Metrics	Measure general question-answering accuracy and context recall

Based on internal testing, Eddy AI achieves an accuracy rate of 96–98% when responding to user queries.

FAQ

What is the expected margin of error for Eddy AI responses? How is adherence to the permitted error margin monitored and measured?

Based on our internal testing, Eddy AI demonstrates an accuracy rate of 96–98% when responding to user queries. We are actively integrating LLM observability tools and use evaluation frameworks such as OpenAI Evals, RAGAS, and GeneralQA metrics to assess performance and accuracy against defined benchmarks.

Has the product been assessed for bias, toxicity, or harmful content such as threats, profanity, or political polarity?

Yes, we use OpenAI Moderation APIs to evaluate responses for harmful content. If a response is flagged, Eddy AI will either avoid generating the response.

How is the risk of AI hallucination managed in Eddy AI?

Document360 has an AI risk mitigation strategy. Eddy AI is strictly constrained to your knowledge base content. Our system prompts guide the AI to avoid generating unsupported or cooked up responses. If Eddy AI is unsure or cannot cite a reliable source, it will respond with "I do not know."

Are AI decisions explainable, and is there human oversight in the process?

Yes, all AI-generated responses from Eddy AI include inline citations, allowing end users to clearly see the source of the information and understand how the response is generated. Additionally, we follow a human-in-the-loop approach as part of our AI governance. While Eddy AI can assist with recommendations, final decisions are left to humans, ensuring oversight and accountability.

How do you ensure transparency and identify biases? How do the AI models generate responses?

We use OpenAI's LLMs and rely on their scorecards and reports for transparency. We follow red teaming best practices to identify biases and periodically test for model drift. For generating responses, we use a Retrieval Augmented Generation (RAG) approach, where context is retrieved from our knowledge base and sent to the LLM.

What steps are taken to ensure the reliability and performance of your AI models?

We offer a 99.9% uptime SLA and are working on integrating a backup LLM provider. We regularly monitor for anomalies using red teaming practices, model drift tests, and evaluations that track parameters like accuracy and context recall. If undesirable behavior is detected, we investigate the root cause and may adjust system prompts, upgrade to a new LLM, or suggest content changes to customers.

How do you manage model updates and ensure ongoing performance?

We regularly run evaluations to monitor performance and model drift. Based on the results, we update our systems and adopt newer LLMs to improve performance and accuracy.

Is Eddy AI vulnerable to prompt injection attacks or capable of generating information outside the knowledge base?

Eddy AI is designed with safeguards to mitigate prompt injection attacks and prevent unauthorized or misleading responses. The platform validates and moderates user inputs, isolates prompts from system instructions, and applies additional security controls aligned with OWASP LLM security recommendations.

Eddy AI follows a Retrieval-Augmented Generation (RAG) approach and generates responses only from the content available within your Document360 knowledge base. If relevant information is not available in the knowledge base or the AI cannot generate a reliable grounded response, Eddy AI will respond accordingly instead of generating unsupported content.

Documentation Index

AI safety testing and defence

Moderation API

DAN-Style Jailbreak Testing

Adversarial Testing

LLM Observability and Performance Evaluation

FAQ