4 min read

Part 3. Internal operation of Connection Draining (TCP perspective)

We decompose Connection Draining into TCP lifecycle and explain how SYN, FIN, RST, TIME_WAIT, and CLOSE_WAIT appear in actual operational indicators.

Series: Graceful Drain 완벽 가이드

7편 구성. 현재 3편을 보고 있습니다.

Part 3. Internal operation of Connection Draining (TCP perspective)

To really understand Connection Draining, you need to look at TCP state transitions before the HTTP logs. Most of the time when drain fails below L4 Load Balancer, it is when the SYN, FIN, RST, TIME_WAIT, CLOSE_WAIT distribution is broken.

Based on version

  • Linux Kernel 5.15+
  • Nginx 1.25+
  • tcpdump 4.99+

1. Correlation between TCP lifecycle and drain

Key takeaways

  • Drain is the process of blocking new SYN and waiting for existing ESTABLISHED to terminate naturally (FIN).
  • If the RST ratio increases, it is closer to hard close rather than graceful.

Detailed description

Changes expected from normal drain:

  • SYN_RECV: Decrease quickly
  • ESTABLISHED: Slowly decreasing
  • FIN_WAIT*, TIME_WAIT: Temporarily increased
  • CLOSE_WAIT: Stay low

Abnormal pattern:

  • CLOSE_WAIT Accumulation: Application performs socket close late.
  • RST Spike: Force kill process or force network disconnection
  • TIME_WAIT Explosion: Excessive short-lived connections + keepalive policy imbalance

Practical tips

  • ss -ant snapshots are automatically saved at 3 points before/during/after drain.
  • CLOSE_WAIT Set as a condition for discontinuation of distribution when the threshold is exceeded.

Common Mistakes

  • An increase in TIME_WAIT is unconditionally judged as a failure.
  • CLOSE_WAIT is mistaken for a kernel problem and the application termination logic is not checked.
# 상태별 TCP 소켓 개수 확인
ss -ant | awk '{print $1}' | sort | uniq -c

# 드레인 중 특정 포트(443) 상태 추적
watch -n 2 "ss -ant '( sport = :443 )' | awk '{print \$1}' | sort | uniq -c"

2. Drain state machine design

Key takeaways

  • Drain must be managed with a state machine to be reproducible.
  • State transitions must be indicator-based, and there is a high probability of failure if decisions are made only with timeouts.

Detailed description

Recommended state transitions:1. BOUND: Normal service 2. DRAIN_REQUESTED: unbind called 3. DRAINING: new=0, observe decrease in activity 4. GRACEFUL_STOP: Start App graceful shutdown 5. TERMINATED: Terminate process, replace VM 6. REBOUND: new version bind

The conditions for each state must be defined numerically.

  • DRAINING -> GRACEFUL_STOP: active is below the threshold (e.g. 20)
  • GRACEFUL_STOP -> TERMINATED: FIN completed within shutdown waiting time
  • Notification/approval before forced termination when timeout is exceeded

Practical tips

  • The drain status is recorded centrally in the distribution system (e.g. Argo/Jenkins).
  • “Hold + operator confirmation” is often safer than automatic rollback when state transition fails.

Common Mistakes

  • The drain status is only left in the log and is not combined with metrics.
  • There is no fencing token in the state transition, so duplicate operations occur.

3. Diagram: TCP state and drain state machine

Key takeaways

  • TCP state flow and operational state flow must be separated and visualized to speed up cause tracking.

Detailed description

Mermaid diagram rendering...
Drain State Machine

BOUND
  -> DRAIN_REQUESTED (unbind)
  -> DRAINING (new_conn=0)
  -> GRACEFUL_STOP (active_conn <= threshold)
  -> TERMINATED
  -> REBOUND

Abort path:
DRAINING -> FORCE_CLOSE -> RST spike
BOUNDDRAININGGRACEFUL_STOPTERMINATEDREBOUNDnew=0 / active downshutdown done

Practical tips

  • If you overlay the TCP state and drain state on the time axis in the same dashboard, the cause of failure is immediately revealed.

Common Mistakes

  • Missing state transition timing by only looking at a single point of view capture and drawing conclusions.

4. Operational scenario: Deployment with increased RST instead of FIN

Key takeaways

  • Symptoms: Client reconnections rapidly and reset count rapidly at each deployment point.
  • Essence: Failure to take the graceful path and moving to the forced shutdown path

Detailed description

Observation log:

[lb] backend=vm-a reset_out=182/s new_conn=0 active_conn=214
[nginx] worker process exited on signal 9
[app] Shutdown hook not completed before SIGKILL

Cause:

  • systemd stop timeout (30s) was shorter than app graceful timeout (60s).
  • Nginx/App shutdown order reversed

Solved:

  1. Reset app graceful timeout to 45s, systemd timeout to 90s
  2. Shutdown sequence: app graceful -> nginx quit -> vm terminate
  3. Add warning notification before forced shutdown in case of drain failure

Practical tips

  • Add the RST rate < 임계치 verification step to your deployment script.

Common Mistakes

  • It only increases the drain time and does not change the order of termination signals.

Operational Checklist

  • Are drain state machines and timeout policies documented?
  • Are the TIME_WAIT, CLOSE_WAIT, RST thresholds defined?
  • Is the termination signal order (SIGTERM/SIGQUIT/SIGKILL) specified?
  • Does the deployment automation detect and abort state transition failures?

Summary

Connection draining is a state transition issue, not a time issue. Graceful Shutdown can be reproduced only by observing the TCP lifecycle and operating the drain state machine based on indicators.

Next episode preview

In the next part, we will cover why drain does not end in the WebSocket Load Balancing environment and five practical strategies for controlling long-lived connections.

Series navigation

Comments