Part 6. Analysis of actual failure cases (SRE perspective)
Actual failure patterns caused by server removal without drain, WebSocket drain failure, and keepalive setting imbalance are analyzed through symptoms, logs, causes, and solutions.
Series: Graceful Drain 완벽 가이드
총 7편 구성. 현재 6편을 보고 있습니다.
- 01Part 1. Why does service crash if you simply remove the server?
- 02Part 2. L4 Load Balancer bind/unbind and connection lifecycle
- 03Part 3. Internal operation of Connection Draining (TCP perspective)
- 04Part 4. Why drain fails in a WebSocket environment
- 05Part 5. Non-disruptive deployment strategy in VM + Nginx environment
- 06Part 6. Analysis of actual failure cases (SRE perspective)CURRENT
- 07Part 7. Operation checklist and verification method
Part 6. Analysis of actual failure cases (SRE perspective)
What is important in operation is “how to reproduce and reduce failures,” rather than “what technology was used.” This article covers three failures that frequently occur in the L4 -> Nginx -> App -> WebSocket structure.
Based on version
- Linux Kernel 5.15+
- Nginx 1.25+ -JVM 21
1. Case 1: Removing the server without draining
Key takeaways
- If you remove the server without
unbind/Connection Draining, a reset explosion occurs immediately.
Detailed description
Symptoms:
- API error rate rapidly increases (seconds to minutes)
- Mobile client retries skyrocket
- LB reset counter rapidly increases
Log:
[lb] backend vm-c removed from pool abruptly
[lb] reset_out=420/s active_conn dropped 1800 -> 300 in 10s
[nginx] recv() failed (104: Connection reset by peer)
Cause:
- The existing
ESTABLISHEDsession is disconnected without FIN due to hard cut. - app graceful hook does not run
Solved:
- Mandatory
unbind + drainbefore distribution - Specify drain completion condition as
new=0 && active<threshold - Call app graceful API before terminate
Practical tips
- Place a pipeline gate so that the drain stage cannot be omitted in operational automation.
Common Mistakes
- In case of failure, the same hard cut is repeated while making a hasty rollback.
2. Case 2: Unable to remove server due to WebSocket drain failure
Key takeaways
- New connections are blocked, but existing WebSockets remain and drain does not end.
Detailed description
Symptoms:
new connections = 0active connectionsremains almost for more than 30 minutes- Distribution slots are occupied for a long time, delaying the next distribution
Log:
[lb] backend=vm-a draining new_conn=0 active_conn=1197
[app] ws_sessions=1184, avg_age=2h 11m
[deploy] timeout waiting for drain completion (1800s)
Cause:
- Long-lived WebSocket sessions
- No server-driven shutdown policy
- Absence of client reconnect distribution strategy
Solved:
- Transmit
DRAIN_NOTICEwhen drain starts - Incrementally close when session TTL is exceeded
- Apply backoff+jitter to reconnect
- Approval-based forced stage operation when the maximum distribution waiting time is exceeded
Practical tips
- Set the deployment window based on WebSocket session lifetime P99.
Common Mistakes
- I believe that just increasing the timeout will solve the problem.
3. Case 3: TIME_WAIT explosion due to keepalive setting issue
Key takeaways
- If the keepalive strategy is incorrect, TIME_WAIT explodes and the port/CPU load increases.
Detailed description
Symptoms:
- TIME_WAIT socket surge
- Briefly increased connect delay
- Increasing variation between nodes in high traffic sections
Logs/Metrics:
ss -s
TCP: inuse 1840 orphan 0 tw 68321 alloc 2010 mem 321
netstat -an | grep TIME_WAIT | wc -l
68321
Cause:
- keepalive_timeout is too short, increasing connection churn
- Insufficient upstream keepalive pool size
- A large number of short-lived connections are created in burst traffic.
Solved:
- Adjust keepalive_timeout to traffic characteristics
- Increase the number of upstream keepalive connections
- Tuning connection limits and backlog per node in the short term
Practical tips
- Look at the correlation between the increase rate and error rate rather than the TIME_WAIT number itself.
Common Mistakes
- Attempt excessive kernel tuning to bring TIME_WAIT close to 0.
4. Diagram: Failure Reproduction Timeline
Key takeaways
- Reproducing the event along the time axis speeds up the separation of causes.
Detailed description
Timeline (example)
14:00 deploy start
14:02 vm remove without drain -> reset spike
14:10 rollback
15:20 websocket drain starts -> active stuck
16:00 drain timeout
18:10 keepalive tweak mistake -> TIME_WAIT surge
Practical tips
- Be sure to leave “setting change time” and “indicator change time” in the failure review document.
Common Mistakes
- Only log messages are left, and timelines and state transitions are not organized.
Operational Checklist
- Is the no hard cut policy enforced in the automation pipeline?
- Is there an alternative plan (notification/force/retry) in case of WebSocket drain failure?
- Is pre-load testing performed when changing keepalive?
- Are symptoms/causes/solutions/recurrence prevention recorded in the disability review?
Summary
What the three cases have in common is simple. Failures arise from missing state transitions, not from lack of functionality. The termination rules of L4 Load Balancer, Nginx, and App must be bundled into one operating agreement.
Next episode preview
In the next section, we provide a verification checklist and indicator standards that operators can check within 10 minutes immediately before deployment.
Reference link
Series navigation
- Previous post: Part 5. VM + Nginx 환경에서의 무중단 배포 전략
- Next post: Part 7. 운영 체크리스트와 검증 방법