Part 4. Why drain fails in a WebSocket environment

In HTTP-centric systems, Connection Draining often ends quickly. However, in WebSocket Load Balancing, the connection lasts for several hours and the drain virtually stops. This article breaks down the reasons from TCP, Nginx WebSocket Proxy, and application protocol perspectives.

Based on version

Linux Kernel 5.15+
Nginx 1.25+ -JVM 21
WebSocket RFC 6455 implementation (e.g. Spring WebSocket)

1. Why are active connections not decreasing?

Key takeaways

WebSocket is a session unit, not a request unit.
Even if new connections are blocked after drain, existing long-lived sessions remain for a long time.

Detailed description

Real observation example:

Start drain: active connection = 1200
After 30 minutes: active connection = 1197

Reasons for little decline:

The client maintains a persistent connection.
If the server does not send a close frame first, the session continues to live.
If Nginx proxy_read_timeout is long, it does not stop in the middle.

In other words, even if the drain timeout is increased to 10 or 20 minutes, the fundamental problem is not solved. A connection lifetime control policy is needed.

Practical tips

Measures the average/95/99 percentile lifespan of WebSocket sessions and sets drain timeout.
You cannot design a drain policy without knowing the session length distribution.

Common Mistakes

The HTTP standard drain timeout is also applied to WebSocket.
Proceed with deployment without checking the rate of decrease in active connections.

2. Decomposition of failure causes: TCP, Nginx, App

Key takeaways

TCP layer: The connection is normal, it just lasts a long time.
Nginx layer: Upgraded connections behave differently from regular keepalives.
App layer: Without a server-driven close/reconnect protocol, drain does not end.

Detailed description

TCP perspective:

The session is maintained in ESTABLISHED status.
Since there are no errors, FIN/RST events do not occur naturally.

Nginx perspective:

Tunneling is formed with proxy_set_header Upgrade and Connection "upgrade".
If proxy_read_timeout is 3600 seconds, the connection can be maintained even if it is idle for an hour.

Application perspective:

If the server does not send a drain notification or session termination signal, the client will remain stuck.
If there is no reconnection backoff policy, a reconnect storm occurs at termination.

Practical tips

The server explicitly induces reconnection by placing session_start_at and ttl_at in the WebSocket session.
During drain, a “reconnect after N seconds” signal is transmitted as a broadcast.

Common Mistakes

Since there is ping/pong, I assume it will be drained automatically.
Operation control becomes impossible as the reconnect logic is placed separately for each client.

3. Diagram: WebSocket persistent connection

Key takeaways

Drain failure occurs not because the connection is abnormal, but because it is “too normal.”

Detailed description

Mermaid diagram rendering...

WebSocket drain bottleneck

new_conn: 0  (good)
active_conn: 1200 -> 1197 (30m)
reset_count: low

=> not an error spike
=> long-lived sessions preventing drain completion

Practical tips

Automatically detects “virtual congestion” by calculating the drain progress indicator as delta(active_conn)/minute.

Common Mistakes

Since there is no RST, it is judged to be normal and the distribution window is extended indefinitely.

4. 5 solution strategies

Key takeaways

The solution is not to extend the timeout, but to “intentionally end the session.”

Detailed description

Strategy 1. Application graceful shutdown

Reject new WS handshake when app receives SIGTERM
Wait for grace period after sending close code (e.g. 1001) to existing session

Strategy 2. Nginx idle timeout adjustment

An overly long proxy_read_timeout delays drain completion.
Gradually reduce according to traffic characteristics (e.g. 3600 -> 900)

Strategy 3. Limit connection lifetime

Introduction of session maximum lifetime (TTL)
Induce reconnection when TTL is reached

Strategy 4. Induce application reconnection

Distributed reconnection based on server notification (backoff + jitter)
Apply random delayed reconnection at drain time

Strategy 5. Drain notification

Broadcast “drain notice” to the entire session at the start of operation
Provides advance warning and grace window to minimize user impact

Practical tips

Do not turn on all five strategies at the same time, but apply them gradually in a measurable order.
The priority is usually app graceful -> reconnect 정책 -> timeout 튜닝.

Common Mistakes

Immediately reconnects the client without jitter and overloads the LB again.

// WebSocket graceful shutdown 예시 (개념 코드)
public void onDrainStart() {
    acceptingNewSessions.set(false);
    sessions.forEach(session -> {
        session.send("{\"type\":\"DRAIN_NOTICE\",\"retryAfterSec\":15}");
        session.close(CloseStatus.GOING_AWAY);
    });
}

Operational Checklist

Are new WS handshakes blocked immediately when drain starts?
Is the active connection reduction rate more than the target value?
Has backoff+jitter, which prevents reconnect congestion, been applied to the client?
Does proxy_read_timeout match the service characteristics?
Does drain notification operate without user influence?

Summary

The essence of WebSocket drain failure is not an error, but session lifetime. Zero Downtime Deployment is possible only when Connection Draining and Graceful Shutdown are combined, and the server and client design session termination together.

Next episode preview

In the next part, the actual non-disruptive deployment procedure (7 steps) and configuration examples in the VM + Nginx environment are summarized at the operational runbook level.

Reference link

Previous post: Part 3. Connection Draining의 내부 동작 (TCP 관점)
Next post: Part 5. VM + Nginx 환경에서의 무중단 배포 전략

Part 4. Why drain fails in a WebSocket environment

Series: Graceful Drain 완벽 가이드

Part 4. Why drain fails in a WebSocket environment

Based on version

1. Why are active connections not decreasing?

Key takeaways

Detailed description

Practical tips

Common Mistakes

2. Decomposition of failure causes: TCP, Nginx, App

Key takeaways

Detailed description

Practical tips

Common Mistakes

3. Diagram: WebSocket persistent connection

Key takeaways

Detailed description

Practical tips

Common Mistakes

4. 5 solution strategies

Key takeaways

Detailed description

Practical tips

Common Mistakes

Operational Checklist

Summary

Next episode preview

Reference link

Series navigation

Comments