2 min read

Incident Response Runbook Design

How to create a consistent response flow from alarm reception to communication, recovery, and post-analysis

Incident Response Runbook Design thumbnail

Introduction

The biggest cost in responding to a failure is decision-making delay rather than technical recovery. Even if there is a runbook, it is often difficult to use it on site because it is not up-to-date or the division of roles is ambiguous. This article summarizes an Incident Runbook design method that can be immediately executed from an SRE perspective.

Incident Response Runbook 설계 커버
Wikimedia Commons 기반 무료 이미지

Problem definition

The reason runbooks fail is not document quality but lack of execution.

  • The detection criteria and severity classification are ambiguous, so the start of response is delayed.
  • Initial action is delayed because key commands differ depending on the environment.
  • The same failure is repeated because the post-retrospective is not connected to the runbook update.

A practical runbook should be an index to an executable script, not a document to be read. Step-by-step orders, decision-making standards, and communication systems must be displayed on one screen.

Key concepts

perspectiveDesign criteriaVerification points
detectionAlarm Threshold + Duplicate SuppressionNotification time after detection
Categoryfix table by sevInitial misclassification rate
actionAutomation script linkInitial recovery time
LearningRunbook update after retrospectiveReduced recurrence rate

A good runbook helps you execute faster than a detailed explanation. The key is to accelerate decision-making by compressing commands and checkpoints.

Code example 1: Runbook step definition

export type RunbookStep = {
  id: string;
  objective: string;
  command: string;
  doneWhen: string;
};

export const apiLatencyRunbook: RunbookStep[] = [
  {
    id: "detect",
    objective: "지연 급증 확인",
    command: "./scripts/slo/check-latency.sh --window 5m",
    doneWhen: "p95 > 800ms 재현",
  },
  {
    id: "mitigate",
    objective: "트래픽 완화",
    command: "./scripts/release/rollback-canary.sh",
    doneWhen: "error rate < 1%",
  },
];

Code example 2: Sev classification rule

severity_rules:
  - sev: SEV1
    condition: "checkout_success_rate < 70% for 5m"
    action: "incident channel + oncall page"
  - sev: SEV2
    condition: "api_p95 > 1200ms for 10m"
    action: "oncall page"
  - sev: SEV3
    condition: "single region degraded"
    action: "slack alert"

Architecture flow

Mermaid diagram rendering...

Tradeoffs

  • Runbook standardization increases response speed, but limits some team-specific flexibility.
  • Increasing automated measures reduces human errors, but incorrect automation can increase damage.
  • Forced retrospectives have a great learning effect, but the burden on operating personnel increases.

Cleanup

The goal of Incident Runbook is not document completeness, but shortening recovery time. From notifications to retrospectives, locking steps into a consistent format and linking automation can reduce recurrence costs.

Image source

  • Cover: source link
  • License: CC BY 2.0 / Author: Alan Levine from United States
  • Note: After downloading the free license image from Wikimedia Commons, it was optimized to JPG at 1600px.

Comments