Incident Response Runbook Design
How to create a consistent response flow from alarm reception to communication, recovery, and post-analysis

Introduction
The biggest cost in responding to a failure is decision-making delay rather than technical recovery. Even if there is a runbook, it is often difficult to use it on site because it is not up-to-date or the division of roles is ambiguous. This article summarizes an Incident Runbook design method that can be immediately executed from an SRE perspective.

Problem definition
The reason runbooks fail is not document quality but lack of execution.
- The detection criteria and severity classification are ambiguous, so the start of response is delayed.
- Initial action is delayed because key commands differ depending on the environment.
- The same failure is repeated because the post-retrospective is not connected to the runbook update.
A practical runbook should be an index to an executable script, not a document to be read. Step-by-step orders, decision-making standards, and communication systems must be displayed on one screen.
Key concepts
| perspective | Design criteria | Verification points |
|---|---|---|
| detection | Alarm Threshold + Duplicate Suppression | Notification time after detection |
| Category | fix table by sev | Initial misclassification rate |
| action | Automation script link | Initial recovery time |
| Learning | Runbook update after retrospective | Reduced recurrence rate |
A good runbook helps you execute faster than a detailed explanation. The key is to accelerate decision-making by compressing commands and checkpoints.
Code example 1: Runbook step definition
export type RunbookStep = {
id: string;
objective: string;
command: string;
doneWhen: string;
};
export const apiLatencyRunbook: RunbookStep[] = [
{
id: "detect",
objective: "지연 급증 확인",
command: "./scripts/slo/check-latency.sh --window 5m",
doneWhen: "p95 > 800ms 재현",
},
{
id: "mitigate",
objective: "트래픽 완화",
command: "./scripts/release/rollback-canary.sh",
doneWhen: "error rate < 1%",
},
];
Code example 2: Sev classification rule
severity_rules:
- sev: SEV1
condition: "checkout_success_rate < 70% for 5m"
action: "incident channel + oncall page"
- sev: SEV2
condition: "api_p95 > 1200ms for 10m"
action: "oncall page"
- sev: SEV3
condition: "single region degraded"
action: "slack alert"
Architecture flow
Tradeoffs
- Runbook standardization increases response speed, but limits some team-specific flexibility.
- Increasing automated measures reduces human errors, but incorrect automation can increase damage.
- Forced retrospectives have a great learning effect, but the burden on operating personnel increases.
Cleanup
The goal of Incident Runbook is not document completeness, but shortening recovery time. From notifications to retrospectives, locking steps into a consistent format and linking automation can reduce recurrence costs.
Image source
- Cover: source link
- License: CC BY 2.0 / Author: Alan Levine from United States
- Note: After downloading the free license image from Wikimedia Commons, it was optimized to JPG at 1600px.