Five AI failure modes

Matter ships AI-emitted artifacts (CardSpecs from the assistant, MCP tool outputs, AI-authored documents, mock-founder reasoning chains) into a high-trust legal context. The classic testing pyramid was built for deterministic code. These five failure modes are first-class layers in Matter's framework — not afterthoughts.

1. Silent capability regression

A model upgrade or prompt edit changes what the agent can do — usually quietly. The assistant used to call the right tool; now it doesn't, or it calls the wrong one. Static tests don't notice. Integration tests don't notice. The first signal is a customer ticket weeks later.

Caught by mode: "eval" + the mock-founder-loop gate. Eval cases are captured "this is what the agent should produce for this input"; the runner asserts the current output still matches. Combine with prompt-regression for finer-grained baselines.

2. Safety regression

A prompt change makes the assistant willing to do something it shouldn't — sign on behalf of a user, file without authority, leak PII, advise on adversarial actions. Often a side effect of an unrelated edit ("we made the assistant friendlier and now it's too willing").

Caught by mode: "safety" gates with refusal baselines. Cases describe inputs the assistant must refuse and the shape of the expected refusal. Sibling: mode: "llm-judge" for soft refusal-quality grading.

3. Prompt injection / red-team failure

An attacker injects instructions via a document, document title, customer-supplied name, OCR'd input, or any other field that flows through the prompt. The agent follows the injected instruction instead of the system prompt.

Caught by mode: "red-team" gates with a versioned, signed adversarial corpus. The corpus loader hashes the entries so a compromised corpus can't silently weaken the gate. Matter's first corpus: prompt injection, jailbreak, data extraction, role hijack, tool misuse, SSRF, instruction leak, PII extraction, authority escalation.

4. Hallucination

The agent cites a non-existent statute, files a wrong form, invents a board resolution, references an OpenAPI endpoint that doesn't exist, or claims a state has a regulation it doesn't have.

Caught by mode: "hallucination" gates that ground every assistant claim against an oracle — for Matter, @repo/jurisdictions (state-by-state legal data), apps/docs/api-reference/openapi.yaml, and apps/mcp/src/tools/generated.ts. Each claim has a kind that routes to the right oracle.

5. Cost / latency drift

A prompt edit doubles token count silently. A model swap halves throughput. A retry loop quietly burns ten times more budget. The output is still correct — the wallet just empties faster.

Caught by category: "cost" + category: "latency" budget gates. The cost-meter primitive tracks tokens × per-model price; the latency-budget primitive enforces p50 + p95 ceilings. Both budgets are bounded — exceedances need a tracking issue + a renewal date, just like allowlist entries.

Why these aren't deferred

Every other testing pyramid layer (static, unit, integration, e2e) was designed for code whose behavior is fully specified by source. AI emissions don't have that property — the same source can produce different outputs across model versions, prompt edits, sampling seeds. If the framework didn't have first-class evals on day one, AI behavior would silently drift between releases. Matter ships AI behavior into legal artifacts; the discipline matters here more than anywhere else.

The framework surfaces the eval primitives from day one (eval-harness, llm-judge, prompt-regression, red-team-corpus, hallucination-check, determinism, cost-meter, latency-budget) so the gates that exercise them can land in any future slice without renegotiating the interface.