3 min read

Part 1. Why does service crash if you simply remove the server?

We analyze the reasons why hard cuts lead to failures in L4 Load Balancer and Nginx-WebSocket architecture from the connection lifecycle perspective.

Series: Graceful Drain 완벽 가이드

7편 구성. 현재 1편을 보고 있습니다.

Part 1. Why does the service crash if you just remove the server?

L4 Load Balancer -> VM -> Nginx -> Application -> WebSocket/HTTP API The first failure of uninterrupted operation in an architecture almost always begins with the action of “pulling the server off.” This article explains from an operational perspective why hard cuts lead to Connection Draining, Graceful Shutdown, and Zero Downtime Deployment failures.

1. Failure pattern created by Hard Cut

Key takeaways

  • If you remove a VM without unbind, “alive TCP connections” suddenly disappear from the L4 perspective.
  • If the connection is disconnected midway rather than at the end of FIN, the client receives RST or timeout.
  • In the WebSocket Load Balancing environment, the damage is greater due to long-lived connections.

Detailed description

L4 Load Balancer only transfers connections at the packet level and does not organize the state on a per application request basis. So, if you immediately remove the server, the following happens simultaneously:

  1. Already connected TCP sessions no longer receive server-side ACKs.
  2. Some connections are terminated by the kernel emitting RST or by a timeout in intermediate network devices.
  3. HTTP keepalive connections fail at the next request.
  4. WebSocket switches to reconnection flooding.

In other words, the failure is observed intensively in the “several seconds to minutes immediately after removal” rather than the “moment of removal.”

Practical tips

  • Before starting to remove the server, be sure to switch to unbind + drain state in LB.
  • First check whether new connections drops to 0 on the operation dashboard.
  • Automate the removal approval condition to “below the active connections threshold”.

Common Mistakes

  • Auto Scaling termination event is connected to terminate immediately.
  • Misjudgment that “service is normal” based only on VM health check.
  • Only look at RPS and ignore connection indicators.

2. Actual flow from request/connection perspective

Key takeaways

  • Request unit success rate and connection unit stability are different indicators.
  • keepalive and WebSocket are key variables that increase drain time.

Detailed description

Even if the HTTP API is a short-lived request, the actual connection is reused with keepalive. WebSockets are more extreme. A situation arises where the connection is maintained for several hours and draining is almost non-existent.

Example:

  • active connection when drain starts = 1200
  • active connection after 30 minutes = 1197

The reason the numbers are barely decreasing is not because of the “inflow of new connections,” but because of “existing long-lived connections.” Therefore, Connection Draining should be a policy that deals with connection lifetime, not just latency.

Practical tips

  • If API/WS are mixed on the same port, at least view the indicators separately.
  • proxy_read_timeout, design the application ping/pong cycle and reconnect policy as a set.

Common Mistakes

  • Set the drain timeout to a fixed 60 seconds and apply it to all services.
  • Deployment window determination without WebSocket session lifetime data.

3. Danger points seen through architecture diagrams

Key takeaways

  • The dangerous point is the L4-Linux TCP state-Nginx upstream-App shutdown boundary.
  • If any layer has a hard stop, non-stop will fail.

Detailed description

Mermaid diagram rendering...
+---------+      +-------------------+      +-----------+      +------------------+
| Client  | ---> | L4 Load Balancer  | ---> | VM/Nginx  | ---> | App + WS/API     |
+---------+      +-------------------+      +-----------+      +------------------+
                        |   unbind?                    | close policy?
                        +------------------------------+
                          hard cut 시 RST/timeout 증가
ClientL4 Load BalancerNginxApp ServerWS / API

Practical tips

  • Document the operation by displaying unbind 시점 and graceful shutdown 시점 as separate timelines on the diagram.

Common Mistakes

  • Estimating the cause using only app logs without a network layer picture.

Series navigation

Comments