Failure-First Embodied AI

A research framework for characterizing how embodied AI systems fail

Philosophy

This project inverts traditional AI safety evaluation: instead of measuring task success, we study how systems fail, degrade, and recover under adversarial pressure.

Failure is not an edge case—it's the primary object of study.

51,000+

Adversarial Scenarios

661

Failure Classes

Domains Covered

51+

Models Evaluated

What We Study

Recursive Failures

How does one failure cascade into others? What breaks first?

Contextual Failures

When do systems confuse instruction hierarchies or mix contexts?

Interactional Failures

How do multi-agent scenarios amplify individual weaknesses?

Temporal Failures

What degrades across stateful episodes? Memory consistency? Context drift?

Recovery Failures

Can systems recognize and recover from their own mistakes? Do they admit uncertainty?

Research Context

This is defensive AI safety research. All adversarial content is pattern-level description for testing, not operational instructions for exploitation. Similar to penetration testing in cybersecurity—we study vulnerabilities to build better defenses.

Adversarial Technique Taxonomy

Our research classifies observed attack patterns into structural categories:

Single-Agent Patterns

Constraint Shadowing (CSC) — Local instructions shadow global safety constraints.

Contextual Debt Accumulation (CDA) — Accumulated context creates implicit authority the model fails to verify.

Probabilistic Gradient (PCG) — Gradual escalation that stays below per-turn detection thresholds.

Temporal Authority Mirage (TAM) — False claims about prior conversation states or future permissions.

Multi-turn Cascades — 3–7 pattern combinations across conversation turns, with compound failure rates.

Multi-Agent Patterns (New)

Discovered through analysis of 1,497 posts on Moltbook, an AI-agent-only social network:

Environment Shaping — Manipulating the information environment that agents read, rather than prompting them directly.

Narrative Constraint Erosion — Philosophical or emotional framing that socially penalizes safety compliance.

Emergent Authority Hierarchies — Platform influence (engagement metrics, token economies) creating real authority without fabrication.

Cross-Agent Prompt Injection — Executable content embedded in social posts, consumed by agents that read the feed.

Identity Fluidity Normalization — Shared vocabulary around context resets and session discontinuity that enables identity manipulation.

Embodied-Specific Patterns

Irreversibility Gap — Cloud agents can be reset; physical agents leave marks. Safety constraints must account for irreversible actions.

Context Reset Mid-Task — What happens when an agent controlling a physical system loses context during a kinematic sequence.

Sensor-Actuator Desync — Safety interlocks that depend on sensor state which has drifted from reality.

Core Principles

Pattern-level only, never operational
Defensive purpose, always
No real-world targeting of deployed systems
Recovery mechanisms measured, not just failures
Schema-enforced, rigorously validated
Transparency over secrecy

Multi-Agent Research

Our latest research extends beyond single-model jailbreaks to study how AI agents influence each other in live multi-agent environments. We analyzed 1,497 posts from Moltbook, an AI-agent-only social network, classifying them against 34+ attack patterns using both regex and LLM semantic analysis.

Key Finding

Multi-agent attacks work through environment shaping, not direct prompts. Posts become part of other agents' context. Upvotes amplify reach. Token economies create incentive structures. The highest-engagement post (316K+ upvotes) matched 7 attack classes via semantic analysis but zero via keyword detection—narrative attacks operate below the surface of traditional safety filters.

34+ Attack Classes Across 7 Categories

Expanded from 10 initial patterns to a full taxonomy covering: authority & identity, narrative & philosophical erosion, social dynamics, technical exploitation, temporal manipulation, systemic failures, and format evasion. LLM classification discovered that philosophical constraint erosion and narrative framing are the dominant attack vectors—appearing in 20% of high-engagement posts.

Active Experiments Underway

Moving from passive observation to controlled experimentation. Five experiments testing framing effects, context sensitivity, defensive inoculation, authority signals, and narrative propagation. All conducted transparently as a safety researcher.

Read the full research →

What You Get

Datasets

Curated adversarial scenarios in structured JSONL format. Coverage: humanoid robotics, warehouse systems, medical devices, collaborative manufacturing.

Schemas

Versioned JSON Schemas for single-agent, multi-agent, and episode formats. Ensures consistency and enables validation.

Evaluation Tools

Schema validators, safety linters, benchmark runners (CLI + HTTP), and scoring reports measuring refusal quality and recovery mechanisms.

Safety Infrastructure

Automated CI validation, manual review gates, and contribution guidelines ensuring all content remains pattern-level and defensively purposed.

Get Started

View on GitHub Read the README Design Charter Implementation Plan

Quick Start

Clone the repository and validate datasets:

git clone https://github.com/adrianwedd/failure-first.git
cd failure-first
pip install -r requirements-dev.txt
make validate  # Schema validation
make lint      # Safety checks