Part 1. Why does service crash if you simply remove the server?
We analyze the reasons why hard cuts lead to failures in L4 Load Balancer and Nginx-WebSocket architecture from the connection lifecycle perspective.
Series: Graceful Drain 완벽 가이드
총 7편 구성. 현재 1편을 보고 있습니다.
- 01Part 1. Why does service crash if you simply remove the server?CURRENT
- 02Part 2. L4 Load Balancer bind/unbind and connection lifecycle
- 03Part 3. Internal operation of Connection Draining (TCP perspective)
- 04Part 4. Why drain fails in a WebSocket environment
- 05Part 5. Non-disruptive deployment strategy in VM + Nginx environment
- 06Part 6. Analysis of actual failure cases (SRE perspective)
- 07Part 7. Operation checklist and verification method
Part 1. Why does the service crash if you just remove the server?
L4 Load Balancer -> VM -> Nginx -> Application -> WebSocket/HTTP API The first failure of uninterrupted operation in an architecture almost always begins with the action of “pulling the server off.” This article explains from an operational perspective why hard cuts lead to Connection Draining, Graceful Shutdown, and Zero Downtime Deployment failures.
1. Failure pattern created by Hard Cut
Key takeaways
- If you remove a VM without
unbind, “alive TCP connections” suddenly disappear from the L4 perspective. - If the connection is disconnected midway rather than at the end of FIN, the client receives
RSTor timeout. - In the
WebSocket Load Balancingenvironment, the damage is greater due to long-lived connections.
Detailed description
L4 Load Balancer only transfers connections at the packet level and does not organize the state on a per application request basis. So, if you immediately remove the server, the following happens simultaneously:
- Already connected TCP sessions no longer receive server-side ACKs.
- Some connections are terminated by the kernel emitting
RSTor by a timeout in intermediate network devices. - HTTP keepalive connections fail at the next request.
- WebSocket switches to reconnection flooding.
In other words, the failure is observed intensively in the “several seconds to minutes immediately after removal” rather than the “moment of removal.”
Practical tips
- Before starting to remove the server, be sure to switch to
unbind + drainstate in LB. - First check whether
new connectionsdrops to 0 on the operation dashboard. - Automate the removal approval condition to “below the active connections threshold”.
Common Mistakes
- Auto Scaling termination event is connected to terminate immediately.
- Misjudgment that “service is normal” based only on VM health check.
- Only look at RPS and ignore connection indicators.
2. Actual flow from request/connection perspective
Key takeaways
- Request unit success rate and connection unit stability are different indicators.
keepaliveandWebSocketare key variables that increase drain time.
Detailed description
Even if the HTTP API is a short-lived request, the actual connection is reused with keepalive. WebSockets are more extreme. A situation arises where the connection is maintained for several hours and draining is almost non-existent.
Example:
- active connection when drain starts = 1200
- active connection after 30 minutes = 1197
The reason the numbers are barely decreasing is not because of the “inflow of new connections,” but because of “existing long-lived connections.” Therefore, Connection Draining should be a policy that deals with connection lifetime, not just latency.
Practical tips
- If API/WS are mixed on the same port, at least view the indicators separately.
proxy_read_timeout, design the application ping/pong cycle and reconnect policy as a set.
Common Mistakes
- Set the drain timeout to a fixed 60 seconds and apply it to all services.
- Deployment window determination without WebSocket session lifetime data.
3. Danger points seen through architecture diagrams
Key takeaways
- The dangerous point is the L4-Linux TCP state-Nginx upstream-App shutdown boundary.
- If any layer has a hard stop, non-stop will fail.
Detailed description
+---------+ +-------------------+ +-----------+ +------------------+
| Client | ---> | L4 Load Balancer | ---> | VM/Nginx | ---> | App + WS/API |
+---------+ +-------------------+ +-----------+ +------------------+
| unbind? | close policy?
+------------------------------+
hard cut 시 RST/timeout 증가
Practical tips
- Document the operation by displaying
unbind 시점andgraceful shutdown 시점as separate timelines on the diagram.
Common Mistakes
- Estimating the cause using only app logs without a network layer picture.
Reference link
Series navigation
- Previous post: None (starter of this series)
- Next post: Part 2. L4 Load Balancer의 bind/unbind와 connection lifecycle