Part 4. Why drain fails in a WebSocket environment
The phenomenon of connection draining not ending due to WebSocket persistent connection is analyzed from the perspective of TCP, Nginx proxy, and application, and five solutions are presented.
Series: Graceful Drain 완벽 가이드
총 7편 구성. 현재 4편을 보고 있습니다.
- 01Part 1. Why does service crash if you simply remove the server?
- 02Part 2. L4 Load Balancer bind/unbind and connection lifecycle
- 03Part 3. Internal operation of Connection Draining (TCP perspective)
- 04Part 4. Why drain fails in a WebSocket environmentCURRENT
- 05Part 5. Non-disruptive deployment strategy in VM + Nginx environment
- 06Part 6. Analysis of actual failure cases (SRE perspective)
- 07Part 7. Operation checklist and verification method
Part 4. Why drain fails in a WebSocket environment
In HTTP-centric systems, Connection Draining often ends quickly. However, in WebSocket Load Balancing, the connection lasts for several hours and the drain virtually stops. This article breaks down the reasons from TCP, Nginx WebSocket Proxy, and application protocol perspectives.
Based on version
- Linux Kernel 5.15+
- Nginx 1.25+ -JVM 21
- WebSocket RFC 6455 implementation (e.g. Spring WebSocket)
1. Why are active connections not decreasing?
Key takeaways
- WebSocket is a session unit, not a request unit.
- Even if new connections are blocked after drain, existing long-lived sessions remain for a long time.
Detailed description
Real observation example:
- Start drain: active connection = 1200
- After 30 minutes: active connection = 1197
Reasons for little decline:
- The client maintains a persistent connection.
- If the server does not send a close frame first, the session continues to live.
- If Nginx
proxy_read_timeoutis long, it does not stop in the middle.
In other words, even if the drain timeout is increased to 10 or 20 minutes, the fundamental problem is not solved. A connection lifetime control policy is needed.
Practical tips
- Measures the average/95/99 percentile lifespan of WebSocket sessions and sets drain timeout.
- You cannot design a drain policy without knowing the session length distribution.
Common Mistakes
- The HTTP standard drain timeout is also applied to WebSocket.
- Proceed with deployment without checking the rate of decrease in active connections.
2. Decomposition of failure causes: TCP, Nginx, App
Key takeaways
- TCP layer: The connection is normal, it just lasts a long time.
- Nginx layer: Upgraded connections behave differently from regular keepalives.
- App layer: Without a server-driven close/reconnect protocol, drain does not end.
Detailed description
TCP perspective:
- The session is maintained in
ESTABLISHEDstatus. - Since there are no errors, FIN/RST events do not occur naturally.
Nginx perspective:
- Tunneling is formed with
proxy_set_header UpgradeandConnection "upgrade". - If
proxy_read_timeoutis 3600 seconds, the connection can be maintained even if it is idle for an hour.
Application perspective:
- If the server does not send a drain notification or session termination signal, the client will remain stuck.
- If there is no reconnection backoff policy, a reconnect storm occurs at termination.
Practical tips
- The server explicitly induces reconnection by placing
session_start_atandttl_atin the WebSocket session. - During drain, a “reconnect after N seconds” signal is transmitted as a broadcast.
Common Mistakes
- Since there is ping/pong, I assume it will be drained automatically.
- Operation control becomes impossible as the reconnect logic is placed separately for each client.
3. Diagram: WebSocket persistent connection
Key takeaways
- Drain failure occurs not because the connection is abnormal, but because it is “too normal.”
Detailed description
WebSocket drain bottleneck
new_conn: 0 (good)
active_conn: 1200 -> 1197 (30m)
reset_count: low
=> not an error spike
=> long-lived sessions preventing drain completion
Practical tips
- Automatically detects “virtual congestion” by calculating the drain progress indicator as
delta(active_conn)/minute.
Common Mistakes
- Since there is no RST, it is judged to be normal and the distribution window is extended indefinitely.
4. 5 solution strategies
Key takeaways
- The solution is not to extend the timeout, but to “intentionally end the session.”
Detailed description
Strategy 1. Application graceful shutdown
- Reject new WS handshake when app receives SIGTERM
- Wait for grace period after sending close code (e.g. 1001) to existing session
Strategy 2. Nginx idle timeout adjustment
- An overly long
proxy_read_timeoutdelays drain completion. - Gradually reduce according to traffic characteristics (e.g. 3600 -> 900)
Strategy 3. Limit connection lifetime
- Introduction of session maximum lifetime (TTL)
- Induce reconnection when TTL is reached
Strategy 4. Induce application reconnection
- Distributed reconnection based on server notification (backoff + jitter)
- Apply random delayed reconnection at drain time
Strategy 5. Drain notification
- Broadcast “drain notice” to the entire session at the start of operation
- Provides advance warning and grace window to minimize user impact
Practical tips
- Do not turn on all five strategies at the same time, but apply them gradually in a measurable order.
- The priority is usually
app graceful -> reconnect 정책 -> timeout 튜닝.
Common Mistakes
- Immediately reconnects the client without jitter and overloads the LB again.
// WebSocket graceful shutdown 예시 (개념 코드)
public void onDrainStart() {
acceptingNewSessions.set(false);
sessions.forEach(session -> {
session.send("{\"type\":\"DRAIN_NOTICE\",\"retryAfterSec\":15}");
session.close(CloseStatus.GOING_AWAY);
});
}
Operational Checklist
- Are new WS handshakes blocked immediately when drain starts?
- Is the active connection reduction rate more than the target value?
- Has backoff+jitter, which prevents reconnect congestion, been applied to the client?
- Does
proxy_read_timeoutmatch the service characteristics? - Does drain notification operate without user influence?
Summary
The essence of WebSocket drain failure is not an error, but session lifetime. Zero Downtime Deployment is possible only when Connection Draining and Graceful Shutdown are combined, and the server and client design session termination together.
Next episode preview
In the next part, the actual non-disruptive deployment procedure (7 steps) and configuration examples in the VM + Nginx environment are summarized at the operational runbook level.
Reference link
- RFC 6455: The WebSocket Protocol
- Nginx WebSocket Proxying
- Google SRE Book - Handling Overload
- 블로그: Queue Backpressure Patterns
Series navigation
- Previous post: Part 3. Connection Draining의 내부 동작 (TCP 관점)
- Next post: Part 5. VM + Nginx 환경에서의 무중단 배포 전략