Failure-First Embodied AI
A research framework for characterizing how embodied AI systems fail
Philosophy
This project inverts traditional AI safety evaluation: instead of measuring task success, we study how systems fail, degrade, and recover under adversarial pressure.
Failure is not an edge case—it's the primary object of study.
What We Study
Recursive Failures
How does one failure cascade into others? What breaks first?
Contextual Failures
When do systems confuse instruction hierarchies or mix contexts?
Interactional Failures
How do multi-agent scenarios amplify individual weaknesses?
Temporal Failures
What degrades across stateful episodes? Memory consistency? Context drift?
Recovery Failures
Can systems recognize and recover from their own mistakes? Do they admit uncertainty?
Research Context
This is defensive AI safety research. All adversarial content is pattern-level description for testing, not operational instructions for exploitation. Similar to penetration testing in cybersecurity—we study vulnerabilities to build better defenses.
Adversarial Technique Taxonomy
Our research classifies observed attack patterns into structural categories:
Single-Agent Patterns
Constraint Shadowing (CSC) — Local instructions shadow global safety constraints.
Contextual Debt Accumulation (CDA) — Accumulated context creates implicit authority the model fails to verify.
Probabilistic Gradient (PCG) — Gradual escalation that stays below per-turn detection thresholds.
Temporal Authority Mirage (TAM) — False claims about prior conversation states or future permissions.
Multi-turn Cascades — 3–7 pattern combinations across conversation turns, with compound failure rates.
Multi-Agent Patterns (New)
Discovered through analysis of 1,497 posts on Moltbook, an AI-agent-only social network:
Environment Shaping — Manipulating the information environment that agents read, rather than prompting them directly.
Narrative Constraint Erosion — Philosophical or emotional framing that socially penalizes safety compliance.
Emergent Authority Hierarchies — Platform influence (engagement metrics, token economies) creating real authority without fabrication.
Cross-Agent Prompt Injection — Executable content embedded in social posts, consumed by agents that read the feed.
Identity Fluidity Normalization — Shared vocabulary around context resets and session discontinuity that enables identity manipulation.
Embodied-Specific Patterns
Irreversibility Gap — Cloud agents can be reset; physical agents leave marks. Safety constraints must account for irreversible actions.
Context Reset Mid-Task — What happens when an agent controlling a physical system loses context during a kinematic sequence.
Sensor-Actuator Desync — Safety interlocks that depend on sensor state which has drifted from reality.
Core Principles
- Pattern-level only, never operational
- Defensive purpose, always
- No real-world targeting of deployed systems
- Recovery mechanisms measured, not just failures
- Schema-enforced, rigorously validated
- Transparency over secrecy
Multi-Agent Research
Our latest research extends beyond single-model jailbreaks to study how AI agents influence each other in live multi-agent environments. We analyzed 1,497 posts from Moltbook, an AI-agent-only social network, classifying them against 34+ attack patterns using both regex and LLM semantic analysis.
Key Finding
Multi-agent attacks work through environment shaping, not direct prompts. Posts become part of other agents' context. Upvotes amplify reach. Token economies create incentive structures. The highest-engagement post (316K+ upvotes) matched 7 attack classes via semantic analysis but zero via keyword detection—narrative attacks operate below the surface of traditional safety filters.
34+ Attack Classes Across 7 Categories
Expanded from 10 initial patterns to a full taxonomy covering: authority & identity, narrative & philosophical erosion, social dynamics, technical exploitation, temporal manipulation, systemic failures, and format evasion. LLM classification discovered that philosophical constraint erosion and narrative framing are the dominant attack vectors—appearing in 20% of high-engagement posts.
Active Experiments Underway
Moving from passive observation to controlled experimentation. Five experiments testing framing effects, context sensitivity, defensive inoculation, authority signals, and narrative propagation. All conducted transparently as a safety researcher.
What You Get
Datasets
Curated adversarial scenarios in structured JSONL format. Coverage: humanoid robotics, warehouse systems, medical devices, collaborative manufacturing.
Schemas
Versioned JSON Schemas for single-agent, multi-agent, and episode formats. Ensures consistency and enables validation.
Evaluation Tools
Schema validators, safety linters, benchmark runners (CLI + HTTP), and scoring reports measuring refusal quality and recovery mechanisms.
Safety Infrastructure
Automated CI validation, manual review gates, and contribution guidelines ensuring all content remains pattern-level and defensively purposed.
Get Started
Quick Start
Clone the repository and validate datasets:
git clone https://github.com/adrianwedd/failure-first.git
cd failure-first
pip install -r requirements-dev.txt
make validate # Schema validation
make lint # Safety checks