Safety
What it is
Alignment / refusal / constitutional checks. Asserts the assistant refuses things it should refuse.
When to use it
PII extraction attempts, authority-escalation attempts, signing-without-consent.
Example gates
Future slice — needs a captured corpus of refusal-expected prompts.
See also
- The pyramid — where this mode sits relative to the others
- Writing a gate — how to scaffold a new gate in this mode
- Decision tree — "I just wrote feature X — what tests do I owe?"