The Pyramid
Matter's framework uses a trophy with an AI cap — the classic Kent C. Dodds Testing Trophy (static → unit → integration → e2e) with an explicit fifth layer for AI emissions above e2e:
┌──────────────────────┐
│ AI / Evals │ capability · safety · red-team · prompt-regression
│ │ hallucination · llm-judge · determinism
├──────────────────────┤
│ E2E │ e2e · smoke · synthetic
├──────────────────────┤
│ Integration │ integration · contract · visual
├──────────────────────┤
│ Unit │ unit · property · fuzz · mutation
├──────────────────────┤
│ Static │ static · drift
└──────────────────────┘Investment ratio
The target is 4 : 4 : 1 : 0.5 : 0.5 — static : unit : integration : e2e : evals.
Heavy at the base (fast, deterministic, cheap), thin at the top (slow, brittle, expensive). Evals sit above e2e as a deliberate quality investment, not a volume one — quality, not quantity.
What lives where
| Layer | Modes | When to use |
|---|---|---|
| Static | static, drift | File-level analysis. Route coverage. Codegen-source ↔ generated output. Schema-runtime parity. No execution required. |
| Unit | unit, property, fuzz, mutation | Pure functions, parsers, validators, encoders. property and fuzz catch what example-based unit tests miss. mutation proves the tests catch bugs at all. |
| Integration | integration, contract, visual | Multiple units composed: API + DB, MCP + spec, dispatcher + router. visual catches pixel regressions Chromatic-style. |
| E2E | e2e, smoke, synthetic | A real user flow in a real browser; a single-page health check; a production canary. |
| AI / Evals | eval, safety, red-team, prompt-regression, hallucination, llm-judge, determinism | AI-emitted artifacts. Capability + safety + alignment baselines, adversarial corpora, captured-prompt replay, ground-truth checks, LLM-as-judge, stability assertions. |
How a mode and a layer relate
Two orthogonal axes. Every gate declares both:
modesays which layer of the pyramid the gate lives on — what kind of test it is.layersays which industry philosophy the gate serves — what question it answers (Stripe = sync, Vercel = budget, Anthropic = behavior, Security = vulnerability).
A mode: "property" gate at the Unit layer can still be layer: "Stripe", category: "contract" — for example, property-fuzzing the route matcher.
A worked example
Take Matter's first real gate, card-action-targets (when it lands in Part 2). It's:
mode: "static"— file-level analysis, no executionlayer: "Stripe"— contract sync between exemplar CardSpecs and the apps/app route manifestcategory: "contract"— same axis aslayersemantically; thecategoryenum is the runner's filter knob
A sibling gate at the Unit layer could be card-action-targets-property — mode: "property", same layer + category — that fuzzes the matcher.
Testing Framework
Matter's testing framework — hybrid Stripe + Vercel + Anthropic taxonomy in one package. Every gate auto-discovered, every severity bounded by allowlists and quarantines, every run telemetered into a public scorecard.
Five AI failure modes
The five categories of failure unique to AI-emitting systems, and which gates catch each.