4 min read

Part 4. Why drain fails in a WebSocket environment

The phenomenon of connection draining not ending due to WebSocket persistent connection is analyzed from the perspective of TCP, Nginx proxy, and application, and five solutions are presented.

Series: Graceful Drain 완벽 가이드

7편 구성. 현재 4편을 보고 있습니다.

Part 4. Why drain fails in a WebSocket environment

In HTTP-centric systems, Connection Draining often ends quickly. However, in WebSocket Load Balancing, the connection lasts for several hours and the drain virtually stops. This article breaks down the reasons from TCP, Nginx WebSocket Proxy, and application protocol perspectives.

Based on version

  • Linux Kernel 5.15+
  • Nginx 1.25+ -JVM 21
  • WebSocket RFC 6455 implementation (e.g. Spring WebSocket)

1. Why are active connections not decreasing?

Key takeaways

  • WebSocket is a session unit, not a request unit.
  • Even if new connections are blocked after drain, existing long-lived sessions remain for a long time.

Detailed description

Real observation example:

  • Start drain: active connection = 1200
  • After 30 minutes: active connection = 1197

Reasons for little decline:

  1. The client maintains a persistent connection.
  2. If the server does not send a close frame first, the session continues to live.
  3. If Nginx proxy_read_timeout is long, it does not stop in the middle.

In other words, even if the drain timeout is increased to 10 or 20 minutes, the fundamental problem is not solved. A connection lifetime control policy is needed.

Practical tips

  • Measures the average/95/99 percentile lifespan of WebSocket sessions and sets drain timeout.
  • You cannot design a drain policy without knowing the session length distribution.

Common Mistakes

  • The HTTP standard drain timeout is also applied to WebSocket.
  • Proceed with deployment without checking the rate of decrease in active connections.

2. Decomposition of failure causes: TCP, Nginx, App

Key takeaways

  • TCP layer: The connection is normal, it just lasts a long time.
  • Nginx layer: Upgraded connections behave differently from regular keepalives.
  • App layer: Without a server-driven close/reconnect protocol, drain does not end.

Detailed description

TCP perspective:

  • The session is maintained in ESTABLISHED status.
  • Since there are no errors, FIN/RST events do not occur naturally.

Nginx perspective:

  • Tunneling is formed with proxy_set_header Upgrade and Connection "upgrade".
  • If proxy_read_timeout is 3600 seconds, the connection can be maintained even if it is idle for an hour.

Application perspective:

  • If the server does not send a drain notification or session termination signal, the client will remain stuck.
  • If there is no reconnection backoff policy, a reconnect storm occurs at termination.

Practical tips

  • The server explicitly induces reconnection by placing session_start_at and ttl_at in the WebSocket session.
  • During drain, a “reconnect after N seconds” signal is transmitted as a broadcast.

Common Mistakes

  • Since there is ping/pong, I assume it will be drained automatically.
  • Operation control becomes impossible as the reconnect logic is placed separately for each client.

3. Diagram: WebSocket persistent connection

Key takeaways

  • Drain failure occurs not because the connection is abnormal, but because it is “too normal.”

Detailed description

Mermaid diagram rendering...
WebSocket drain bottleneck

new_conn: 0  (good)
active_conn: 1200 -> 1197 (30m)
reset_count: low

=> not an error spike
=> long-lived sessions preventing drain completion
ClientL4 LB (Draining)NginxWebSocket Appactive: 120030m later: 1197

Practical tips

  • Automatically detects “virtual congestion” by calculating the drain progress indicator as delta(active_conn)/minute.

Common Mistakes

  • Since there is no RST, it is judged to be normal and the distribution window is extended indefinitely.

4. 5 solution strategies

Key takeaways

  • The solution is not to extend the timeout, but to “intentionally end the session.”

Detailed description

Strategy 1. Application graceful shutdown

  • Reject new WS handshake when app receives SIGTERM
  • Wait for grace period after sending close code (e.g. 1001) to existing session

Strategy 2. Nginx idle timeout adjustment

  • An overly long proxy_read_timeout delays drain completion.
  • Gradually reduce according to traffic characteristics (e.g. 3600 -> 900)

Strategy 3. Limit connection lifetime

  • Introduction of session maximum lifetime (TTL)
  • Induce reconnection when TTL is reached

Strategy 4. Induce application reconnection

  • Distributed reconnection based on server notification (backoff + jitter)
  • Apply random delayed reconnection at drain time

Strategy 5. Drain notification

  • Broadcast “drain notice” to the entire session at the start of operation
  • Provides advance warning and grace window to minimize user impact

Practical tips

  • Do not turn on all five strategies at the same time, but apply them gradually in a measurable order.
  • The priority is usually app graceful -> reconnect 정책 -> timeout 튜닝.

Common Mistakes

  • Immediately reconnects the client without jitter and overloads the LB again.
// WebSocket graceful shutdown 예시 (개념 코드)
public void onDrainStart() {
    acceptingNewSessions.set(false);
    sessions.forEach(session -> {
        session.send("{\"type\":\"DRAIN_NOTICE\",\"retryAfterSec\":15}");
        session.close(CloseStatus.GOING_AWAY);
    });
}

Operational Checklist

  • Are new WS handshakes blocked immediately when drain starts?
  • Is the active connection reduction rate more than the target value?
  • Has backoff+jitter, which prevents reconnect congestion, been applied to the client?
  • Does proxy_read_timeout match the service characteristics?
  • Does drain notification operate without user influence?

Summary

The essence of WebSocket drain failure is not an error, but session lifetime. Zero Downtime Deployment is possible only when Connection Draining and Graceful Shutdown are combined, and the server and client design session termination together.

Next episode preview

In the next part, the actual non-disruptive deployment procedure (7 steps) and configuration examples in the VM + Nginx environment are summarized at the operational runbook level.

Series navigation

Comments