Part 1. Why does the service crash if you just remove the server?

L4 Load Balancer -> VM -> Nginx -> Application -> WebSocket/HTTP API The first failure of uninterrupted operation in an architecture almost always begins with the action of “pulling the server off.” This article explains from an operational perspective why hard cuts lead to Connection Draining, Graceful Shutdown, and Zero Downtime Deployment failures.

1. Failure pattern created by Hard Cut

Key takeaways

If you remove a VM without unbind, “alive TCP connections” suddenly disappear from the L4 perspective.
If the connection is disconnected midway rather than at the end of FIN, the client receives RST or timeout.
In the WebSocket Load Balancing environment, the damage is greater due to long-lived connections.

Detailed description

L4 Load Balancer only transfers connections at the packet level and does not organize the state on a per application request basis. So, if you immediately remove the server, the following happens simultaneously:

Already connected TCP sessions no longer receive server-side ACKs.
Some connections are terminated by the kernel emitting RST or by a timeout in intermediate network devices.
HTTP keepalive connections fail at the next request.
WebSocket switches to reconnection flooding.

In other words, the failure is observed intensively in the “several seconds to minutes immediately after removal” rather than the “moment of removal.”

Practical tips

Before starting to remove the server, be sure to switch to unbind + drain state in LB.
First check whether new connections drops to 0 on the operation dashboard.
Automate the removal approval condition to “below the active connections threshold”.

Common Mistakes

Auto Scaling termination event is connected to terminate immediately.
Misjudgment that “service is normal” based only on VM health check.
Only look at RPS and ignore connection indicators.

2. Actual flow from request/connection perspective

Key takeaways

Request unit success rate and connection unit stability are different indicators.
keepalive and WebSocket are key variables that increase drain time.

Detailed description

Even if the HTTP API is a short-lived request, the actual connection is reused with keepalive. WebSockets are more extreme. A situation arises where the connection is maintained for several hours and draining is almost non-existent.

Example:

active connection when drain starts = 1200
active connection after 30 minutes = 1197

The reason the numbers are barely decreasing is not because of the “inflow of new connections,” but because of “existing long-lived connections.” Therefore, Connection Draining should be a policy that deals with connection lifetime, not just latency.

Practical tips

If API/WS are mixed on the same port, at least view the indicators separately.
proxy_read_timeout, design the application ping/pong cycle and reconnect policy as a set.

Common Mistakes

Set the drain timeout to a fixed 60 seconds and apply it to all services.
Deployment window determination without WebSocket session lifetime data.

3. Danger points seen through architecture diagrams

Key takeaways

The dangerous point is the L4-Linux TCP state-Nginx upstream-App shutdown boundary.
If any layer has a hard stop, non-stop will fail.

Detailed description

Mermaid diagram rendering...

+---------+      +-------------------+      +-----------+      +------------------+
| Client  | ---> | L4 Load Balancer  | ---> | VM/Nginx  | ---> | App + WS/API     |
+---------+      +-------------------+      +-----------+      +------------------+
                        |   unbind?                    | close policy?
                        +------------------------------+
                          hard cut 시 RST/timeout 증가

Practical tips

Document the operation by displaying unbind 시점 and graceful shutdown 시점 as separate timelines on the diagram.

Common Mistakes

Estimating the cause using only app logs without a network layer picture.

Reference link

Previous post: None (starter of this series)
Next post: Part 2. L4 Load Balancer의 bind/unbind와 connection lifecycle

Part 1. Why does service crash if you simply remove the server?

Series: Graceful Drain 완벽 가이드

Part 1. Why does the service crash if you just remove the server?

1. Failure pattern created by Hard Cut

Key takeaways

Detailed description

Practical tips

Common Mistakes

2. Actual flow from request/connection perspective

Key takeaways

Detailed description

Practical tips

Common Mistakes

3. Danger points seen through architecture diagrams

Key takeaways

Detailed description

Practical tips

Common Mistakes

Reference link

Series navigation

Comments