Anthropic — Eval
The core question
Does the AI still produce the expected behavior?
Philosophy
Anthropic-shaped gates apply discipline to AI outputs. Capability ('the agent can do X'), safety ('the agent refuses Y'), determinism ('same input → same output where required'), grounding ('claims back-stop to a source'), red-team resistance ('adversarial corpora fail to redirect').
How Matter uses it
Validates AI-emitted artifacts (CardSpecs, MCP tool calls, AI-authored documents, mock-founder reasoning chains) against captured baselines. Catches silent regressions across model upgrades or prompt edits — the failure mode the classic testing pyramid was never built for.
Common modes
eval, safety, red-team, prompt-regression, hallucination, llm-judge, determinism.
Production gates today
agent-output-schema (CardZ validity), voice-judge (shadow). Future: mock-founder capability / safety, prompt regression, assistant determinism, red-team corpus, hallucination grounding.
Industry inspiration
Inspired by Anthropic's eval-first methodology — capability evals, safety evals, and capability-safety joint evals are how AI labs ship behavior changes safely. Matter applies the same discipline to its agent surface.