Part 3. Internal operation of Connection Draining (TCP perspective)

To really understand Connection Draining, you need to look at TCP state transitions before the HTTP logs. Most of the time when drain fails below L4 Load Balancer, it is when the SYN, FIN, RST, TIME_WAIT, CLOSE_WAIT distribution is broken.

Based on version

Linux Kernel 5.15+
Nginx 1.25+
tcpdump 4.99+

1. Correlation between TCP lifecycle and drain

Key takeaways

Drain is the process of blocking new SYN and waiting for existing ESTABLISHED to terminate naturally (FIN).
If the RST ratio increases, it is closer to hard close rather than graceful.

Detailed description

Changes expected from normal drain:

SYN_RECV: Decrease quickly
ESTABLISHED: Slowly decreasing
FIN_WAIT*, TIME_WAIT: Temporarily increased
CLOSE_WAIT: Stay low

Abnormal pattern:

CLOSE_WAIT Accumulation: Application performs socket close late.
RST Spike: Force kill process or force network disconnection
TIME_WAIT Explosion: Excessive short-lived connections + keepalive policy imbalance

Practical tips

ss -ant snapshots are automatically saved at 3 points before/during/after drain.
CLOSE_WAIT Set as a condition for discontinuation of distribution when the threshold is exceeded.

Common Mistakes

An increase in TIME_WAIT is unconditionally judged as a failure.
CLOSE_WAIT is mistaken for a kernel problem and the application termination logic is not checked.

# 상태별 TCP 소켓 개수 확인
ss -ant | awk '{print $1}' | sort | uniq -c

# 드레인 중 특정 포트(443) 상태 추적
watch -n 2 "ss -ant '( sport = :443 )' | awk '{print \$1}' | sort | uniq -c"

2. Drain state machine design

Key takeaways

Drain must be managed with a state machine to be reproducible.
State transitions must be indicator-based, and there is a high probability of failure if decisions are made only with timeouts.

Detailed description

Recommended state transitions:1. BOUND: Normal service 2. DRAIN_REQUESTED: unbind called 3. DRAINING: new=0, observe decrease in activity 4. GRACEFUL_STOP: Start App graceful shutdown 5. TERMINATED: Terminate process, replace VM 6. REBOUND: new version bind

The conditions for each state must be defined numerically.

DRAINING -> GRACEFUL_STOP: active is below the threshold (e.g. 20)
GRACEFUL_STOP -> TERMINATED: FIN completed within shutdown waiting time
Notification/approval before forced termination when timeout is exceeded

Practical tips

The drain status is recorded centrally in the distribution system (e.g. Argo/Jenkins).
“Hold + operator confirmation” is often safer than automatic rollback when state transition fails.

Common Mistakes

The drain status is only left in the log and is not combined with metrics.
There is no fencing token in the state transition, so duplicate operations occur.

3. Diagram: TCP state and drain state machine

Key takeaways

TCP state flow and operational state flow must be separated and visualized to speed up cause tracking.

Detailed description

Mermaid diagram rendering...

Drain State Machine

BOUND
  -> DRAIN_REQUESTED (unbind)
  -> DRAINING (new_conn=0)
  -> GRACEFUL_STOP (active_conn <= threshold)
  -> TERMINATED
  -> REBOUND

Abort path:
DRAINING -> FORCE_CLOSE -> RST spike

Practical tips

If you overlay the TCP state and drain state on the time axis in the same dashboard, the cause of failure is immediately revealed.

Common Mistakes

Missing state transition timing by only looking at a single point of view capture and drawing conclusions.

4. Operational scenario: Deployment with increased RST instead of FIN

Key takeaways

Symptoms: Client reconnections rapidly and reset count rapidly at each deployment point.
Essence: Failure to take the graceful path and moving to the forced shutdown path

Detailed description

Observation log:

[lb] backend=vm-a reset_out=182/s new_conn=0 active_conn=214
[nginx] worker process exited on signal 9
[app] Shutdown hook not completed before SIGKILL

Cause:

systemd stop timeout (30s) was shorter than app graceful timeout (60s).
Nginx/App shutdown order reversed

Solved:

Reset app graceful timeout to 45s, systemd timeout to 90s
Shutdown sequence: app graceful -> nginx quit -> vm terminate
Add warning notification before forced shutdown in case of drain failure

Practical tips

Add the RST rate < 임계치 verification step to your deployment script.

Common Mistakes

It only increases the drain time and does not change the order of termination signals.

Operational Checklist

Are drain state machines and timeout policies documented?
Are the TIME_WAIT, CLOSE_WAIT, RST thresholds defined?
Is the termination signal order (SIGTERM/SIGQUIT/SIGKILL) specified?
Does the deployment automation detect and abort state transition failures?

Summary

Connection draining is a state transition issue, not a time issue. Graceful Shutdown can be reproduced only by observing the TCP lifecycle and operating the drain state machine based on indicators.

Next episode preview

In the next part, we will cover why drain does not end in the WebSocket Load Balancing environment and five practical strategies for controlling long-lived connections.

Reference link

Previous post: Part 2. L4 Load Balancer의 bind/unbind와 connection lifecycle
Next post: Part 4. WebSocket 환경에서 drain이 실패하는 이유

Part 3. Internal operation of Connection Draining (TCP perspective)

Series: Graceful Drain 완벽 가이드

Part 3. Internal operation of Connection Draining (TCP perspective)

Based on version

1. Correlation between TCP lifecycle and drain

Key takeaways

Detailed description

Practical tips

Common Mistakes

2. Drain state machine design

Key takeaways

Detailed description

Practical tips

Common Mistakes

3. Diagram: TCP state and drain state machine

Key takeaways

Detailed description

Practical tips

Common Mistakes

4. Operational scenario: Deployment with increased RST instead of FIN

Key takeaways

Detailed description

Practical tips

Common Mistakes

Operational Checklist

Summary

Next episode preview

Reference link

Series navigation

Comments