Incident Response Runbook Design

Introduction

The biggest cost in responding to a failure is decision-making delay rather than technical recovery. Even if there is a runbook, it is often difficult to use it on site because it is not up-to-date or the division of roles is ambiguous. This article summarizes an Incident Runbook design method that can be immediately executed from an SRE perspective.

Incident Response Runbook 설계 커버 — Wikimedia Commons 기반 무료 이미지

Problem definition

The reason runbooks fail is not document quality but lack of execution.

The detection criteria and severity classification are ambiguous, so the start of response is delayed.
Initial action is delayed because key commands differ depending on the environment.
The same failure is repeated because the post-retrospective is not connected to the runbook update.

A practical runbook should be an index to an executable script, not a document to be read. Step-by-step orders, decision-making standards, and communication systems must be displayed on one screen.

Key concepts

perspective	Design criteria	Verification points
detection	Alarm Threshold + Duplicate Suppression	Notification time after detection
Category	fix table by sev	Initial misclassification rate
action	Automation script link	Initial recovery time
Learning	Runbook update after retrospective	Reduced recurrence rate

A good runbook helps you execute faster than a detailed explanation. The key is to accelerate decision-making by compressing commands and checkpoints.

Code example 1: Runbook step definition

export type RunbookStep = {
  id: string;
  objective: string;
  command: string;
  doneWhen: string;
};

export const apiLatencyRunbook: RunbookStep[] = [
  {
    id: "detect",
    objective: "지연 급증 확인",
    command: "./scripts/slo/check-latency.sh --window 5m",
    doneWhen: "p95 > 800ms 재현",
  },
  {
    id: "mitigate",
    objective: "트래픽 완화",
    command: "./scripts/release/rollback-canary.sh",
    doneWhen: "error rate < 1%",
  },
];

Code example 2: Sev classification rule

severity_rules:
  - sev: SEV1
    condition: "checkout_success_rate < 70% for 5m"
    action: "incident channel + oncall page"
  - sev: SEV2
    condition: "api_p95 > 1200ms for 10m"
    action: "oncall page"
  - sev: SEV3
    condition: "single region degraded"
    action: "slack alert"

Architecture flow

Mermaid diagram rendering...

Tradeoffs

Runbook standardization increases response speed, but limits some team-specific flexibility.
Increasing automated measures reduces human errors, but incorrect automation can increase damage.
Forced retrospectives have a great learning effect, but the burden on operating personnel increases.

Cleanup

The goal of Incident Runbook is not document completeness, but shortening recovery time. From notifications to retrospectives, locking steps into a consistent format and linking automation can reduce recurrence costs.

Image source

Cover: source link
License: CC BY 2.0 / Author: Alan Levine from United States
Note: After downloading the free license image from Wikimedia Commons, it was optimized to JPG at 1600px.