3 min read

Part 6. Analysis of actual failure cases (SRE perspective)

Actual failure patterns caused by server removal without drain, WebSocket drain failure, and keepalive setting imbalance are analyzed through symptoms, logs, causes, and solutions.

Series: Graceful Drain 완벽 가이드

7편 구성. 현재 6편을 보고 있습니다.

Part 6. Analysis of actual failure cases (SRE perspective)

What is important in operation is “how to reproduce and reduce failures,” rather than “what technology was used.” This article covers three failures that frequently occur in the L4 -> Nginx -> App -> WebSocket structure.

Based on version

  • Linux Kernel 5.15+
  • Nginx 1.25+ -JVM 21

1. Case 1: Removing the server without draining

Key takeaways

  • If you remove the server without unbind/Connection Draining, a reset explosion occurs immediately.

Detailed description

Symptoms:

  • API error rate rapidly increases (seconds to minutes)
  • Mobile client retries skyrocket
  • LB reset counter rapidly increases

Log:

[lb] backend vm-c removed from pool abruptly
[lb] reset_out=420/s active_conn dropped 1800 -> 300 in 10s
[nginx] recv() failed (104: Connection reset by peer)

Cause:

  • The existing ESTABLISHED session is disconnected without FIN due to hard cut.
  • app graceful hook does not run

Solved:

  1. Mandatory unbind + drain before distribution
  2. Specify drain completion condition as new=0 && active<threshold
  3. Call app graceful API before terminate

Practical tips

  • Place a pipeline gate so that the drain stage cannot be omitted in operational automation.

Common Mistakes

  • In case of failure, the same hard cut is repeated while making a hasty rollback.

2. Case 2: Unable to remove server due to WebSocket drain failure

Key takeaways

  • New connections are blocked, but existing WebSockets remain and drain does not end.

Detailed description

Symptoms:

  • new connections = 0
  • active connections remains almost for more than 30 minutes
  • Distribution slots are occupied for a long time, delaying the next distribution

Log:

[lb] backend=vm-a draining new_conn=0 active_conn=1197
[app] ws_sessions=1184, avg_age=2h 11m
[deploy] timeout waiting for drain completion (1800s)

Cause:

  • Long-lived WebSocket sessions
  • No server-driven shutdown policy
  • Absence of client reconnect distribution strategy

Solved:

  1. Transmit DRAIN_NOTICE when drain starts
  2. Incrementally close when session TTL is exceeded
  3. Apply backoff+jitter to reconnect
  4. Approval-based forced stage operation when the maximum distribution waiting time is exceeded

Practical tips

  • Set the deployment window based on WebSocket session lifetime P99.

Common Mistakes

  • I believe that just increasing the timeout will solve the problem.

3. Case 3: TIME_WAIT explosion due to keepalive setting issue

Key takeaways

  • If the keepalive strategy is incorrect, TIME_WAIT explodes and the port/CPU load increases.

Detailed description

Symptoms:

  • TIME_WAIT socket surge
  • Briefly increased connect delay
  • Increasing variation between nodes in high traffic sections

Logs/Metrics:

ss -s
TCP: inuse 1840 orphan 0 tw 68321 alloc 2010 mem 321

netstat -an | grep TIME_WAIT | wc -l
68321

Cause:

  • keepalive_timeout is too short, increasing connection churn
  • Insufficient upstream keepalive pool size
  • A large number of short-lived connections are created in burst traffic.

Solved:

  1. Adjust keepalive_timeout to traffic characteristics
  2. Increase the number of upstream keepalive connections
  3. Tuning connection limits and backlog per node in the short term

Practical tips

  • Look at the correlation between the increase rate and error rate rather than the TIME_WAIT number itself.

Common Mistakes

  • Attempt excessive kernel tuning to bring TIME_WAIT close to 0.

4. Diagram: Failure Reproduction Timeline

Key takeaways

  • Reproducing the event along the time axis speeds up the separation of causes.

Detailed description

Mermaid diagram rendering...
Timeline (example)
14:00 deploy start
14:02 vm remove without drain -> reset spike
14:10 rollback
15:20 websocket drain starts -> active stuck
16:00 drain timeout
18:10 keepalive tweak mistake -> TIME_WAIT surge
Case1 RST spikeCase2 drain stallCase3 TIME_WAIT14:0015:2018:10

Practical tips

  • Be sure to leave “setting change time” and “indicator change time” in the failure review document.

Common Mistakes

  • Only log messages are left, and timelines and state transitions are not organized.

Operational Checklist

  • Is the no hard cut policy enforced in the automation pipeline?
  • Is there an alternative plan (notification/force/retry) in case of WebSocket drain failure?
  • Is pre-load testing performed when changing keepalive?
  • Are symptoms/causes/solutions/recurrence prevention recorded in the disability review?

Summary

What the three cases have in common is simple. Failures arise from missing state transitions, not from lack of functionality. The termination rules of L4 Load Balancer, Nginx, and App must be bundled into one operating agreement.

Next episode preview

In the next section, we provide a verification checklist and indicator standards that operators can check within 10 minutes immediately before deployment.

Series navigation

Comments