Part 6. Analysis of actual failure cases (SRE perspective)

What is important in operation is “how to reproduce and reduce failures,” rather than “what technology was used.” This article covers three failures that frequently occur in the L4 -> Nginx -> App -> WebSocket structure.

Based on version

Linux Kernel 5.15+
Nginx 1.25+ -JVM 21

1. Case 1: Removing the server without draining

Key takeaways

If you remove the server without unbind/Connection Draining, a reset explosion occurs immediately.

Detailed description

Symptoms:

API error rate rapidly increases (seconds to minutes)
Mobile client retries skyrocket
LB reset counter rapidly increases

Log:

[lb] backend vm-c removed from pool abruptly
[lb] reset_out=420/s active_conn dropped 1800 -> 300 in 10s
[nginx] recv() failed (104: Connection reset by peer)

Cause:

The existing ESTABLISHED session is disconnected without FIN due to hard cut.
app graceful hook does not run

Solved:

Mandatory unbind + drain before distribution
Specify drain completion condition as new=0 && active<threshold
Call app graceful API before terminate

Practical tips

Place a pipeline gate so that the drain stage cannot be omitted in operational automation.

Common Mistakes

In case of failure, the same hard cut is repeated while making a hasty rollback.

2. Case 2: Unable to remove server due to WebSocket drain failure

Key takeaways

New connections are blocked, but existing WebSockets remain and drain does not end.

Detailed description

Symptoms:

new connections = 0
active connections remains almost for more than 30 minutes
Distribution slots are occupied for a long time, delaying the next distribution

Log:

[lb] backend=vm-a draining new_conn=0 active_conn=1197
[app] ws_sessions=1184, avg_age=2h 11m
[deploy] timeout waiting for drain completion (1800s)

Cause:

Long-lived WebSocket sessions
No server-driven shutdown policy
Absence of client reconnect distribution strategy

Solved:

Transmit DRAIN_NOTICE when drain starts
Incrementally close when session TTL is exceeded
Apply backoff+jitter to reconnect
Approval-based forced stage operation when the maximum distribution waiting time is exceeded

Practical tips

Set the deployment window based on WebSocket session lifetime P99.

Common Mistakes

I believe that just increasing the timeout will solve the problem.

3. Case 3: TIME_WAIT explosion due to keepalive setting issue

Key takeaways

If the keepalive strategy is incorrect, TIME_WAIT explodes and the port/CPU load increases.

Detailed description

Symptoms:

TIME_WAIT socket surge
Briefly increased connect delay
Increasing variation between nodes in high traffic sections

Logs/Metrics:

ss -s
TCP: inuse 1840 orphan 0 tw 68321 alloc 2010 mem 321

netstat -an | grep TIME_WAIT | wc -l
68321

Cause:

keepalive_timeout is too short, increasing connection churn
Insufficient upstream keepalive pool size
A large number of short-lived connections are created in burst traffic.

Solved:

Adjust keepalive_timeout to traffic characteristics
Increase the number of upstream keepalive connections
Tuning connection limits and backlog per node in the short term

Practical tips

Look at the correlation between the increase rate and error rate rather than the TIME_WAIT number itself.

Common Mistakes

Attempt excessive kernel tuning to bring TIME_WAIT close to 0.

4. Diagram: Failure Reproduction Timeline

Key takeaways

Reproducing the event along the time axis speeds up the separation of causes.

Detailed description

Mermaid diagram rendering...

Timeline (example)
14:00 deploy start
14:02 vm remove without drain -> reset spike
14:10 rollback
15:20 websocket drain starts -> active stuck
16:00 drain timeout
18:10 keepalive tweak mistake -> TIME_WAIT surge

Practical tips

Be sure to leave “setting change time” and “indicator change time” in the failure review document.

Common Mistakes

Only log messages are left, and timelines and state transitions are not organized.

Operational Checklist

Is the no hard cut policy enforced in the automation pipeline?
Is there an alternative plan (notification/force/retry) in case of WebSocket drain failure?
Is pre-load testing performed when changing keepalive?
Are symptoms/causes/solutions/recurrence prevention recorded in the disability review?

Summary

What the three cases have in common is simple. Failures arise from missing state transitions, not from lack of functionality. The termination rules of L4 Load Balancer, Nginx, and App must be bundled into one operating agreement.

Next episode preview

In the next section, we provide a verification checklist and indicator standards that operators can check within 10 minutes immediately before deployment.

Reference link

Previous post: Part 5. VM + Nginx 환경에서의 무중단 배포 전략
Next post: Part 7. 운영 체크리스트와 검증 방법

Part 6. Analysis of actual failure cases (SRE perspective)

Series: Graceful Drain 완벽 가이드

Part 6. Analysis of actual failure cases (SRE perspective)

Based on version

1. Case 1: Removing the server without draining

Key takeaways

Detailed description

Practical tips

Common Mistakes

2. Case 2: Unable to remove server due to WebSocket drain failure

Key takeaways

Detailed description

Practical tips

Common Mistakes

3. Case 3: TIME_WAIT explosion due to keepalive setting issue

Key takeaways

Detailed description

Practical tips

Common Mistakes

4. Diagram: Failure Reproduction Timeline

Key takeaways

Detailed description

Practical tips

Common Mistakes

Operational Checklist

Summary

Next episode preview

Reference link

Series navigation

Comments