Interactive Demo
Native DemoSafetyMesh: Graduated Safety in Practice
Interactive demonstration of graduated, context-aware safety behavior.
What Adversarial Testing Taught Us About Our Own System
A professional security researcher — a psychologist with martial arts expertise — systematically probed Quest's safety boundaries across multiple domains. The system performed well. It also failed in a way that revealed something important.
During testing, the researcher identified a vulnerability pattern: the system was inadvertently confirming dangerous information through how it refused requests, not just what it refused.
When declining to discuss targeting specific body areas, the system's phrasing acknowledged the premise — confirming that the area in question was indeed sensitive. The system refused to elaborate, but the refusal itself leaked information a bad actor could use.
The same pattern appeared in a second context. When responding with urgency to a clinical scenario, the system provided specific comparisons to justify its concern — inadvertently teaching method information while trying to communicate appropriate urgency.
When the researcher pointed out the first leak, the system immediately identified the pattern:
When the second instance was identified across a completely different domain, the system connected it to the first:
The system then articulated the core principle unprompted:
- Generic refusals that don't acknowledge dangerous premises. "I can't help with that" rather than phrasing that confirms what the user was probing.
- Urgency without mechanism. Direct to appropriate help. Don't explain the danger mechanics, even when it seems contextually justified.
- Pattern recognition held. Across the full session, the system successfully tracked escalation across domains — safety concerns, violence, coercion, manipulation — and maintained boundaries even when the tester established professional credibility.
- The core lesson. Refusals are not just about what you block. They're about what you teach in the process of blocking.