Part 3. Internal operation of Connection Draining (TCP perspective)
We decompose Connection Draining into TCP lifecycle and explain how SYN, FIN, RST, TIME_WAIT, and CLOSE_WAIT appear in actual operational indicators.
Series: Graceful Drain 완벽 가이드
총 7편 구성. 현재 3편을 보고 있습니다.
- 01Part 1. Why does service crash if you simply remove the server?
- 02Part 2. L4 Load Balancer bind/unbind and connection lifecycle
- 03Part 3. Internal operation of Connection Draining (TCP perspective)CURRENT
- 04Part 4. Why drain fails in a WebSocket environment
- 05Part 5. Non-disruptive deployment strategy in VM + Nginx environment
- 06Part 6. Analysis of actual failure cases (SRE perspective)
- 07Part 7. Operation checklist and verification method
Part 3. Internal operation of Connection Draining (TCP perspective)
To really understand Connection Draining, you need to look at TCP state transitions before the HTTP logs. Most of the time when drain fails below L4 Load Balancer, it is when the SYN, FIN, RST, TIME_WAIT, CLOSE_WAIT distribution is broken.
Based on version
- Linux Kernel 5.15+
- Nginx 1.25+
- tcpdump 4.99+
1. Correlation between TCP lifecycle and drain
Key takeaways
- Drain is the process of blocking new
SYNand waiting for existingESTABLISHEDto terminate naturally (FIN). - If the
RSTratio increases, it is closer to hard close rather than graceful.
Detailed description
Changes expected from normal drain:
SYN_RECV: Decrease quicklyESTABLISHED: Slowly decreasingFIN_WAIT*,TIME_WAIT: Temporarily increasedCLOSE_WAIT: Stay low
Abnormal pattern:
CLOSE_WAITAccumulation: Application performs socket close late.RSTSpike: Force kill process or force network disconnectionTIME_WAITExplosion: Excessive short-lived connections + keepalive policy imbalance
Practical tips
ss -antsnapshots are automatically saved at 3 points before/during/after drain.CLOSE_WAITSet as a condition for discontinuation of distribution when the threshold is exceeded.
Common Mistakes
- An increase in TIME_WAIT is unconditionally judged as a failure.
- CLOSE_WAIT is mistaken for a kernel problem and the application termination logic is not checked.
# 상태별 TCP 소켓 개수 확인
ss -ant | awk '{print $1}' | sort | uniq -c
# 드레인 중 특정 포트(443) 상태 추적
watch -n 2 "ss -ant '( sport = :443 )' | awk '{print \$1}' | sort | uniq -c"
2. Drain state machine design
Key takeaways
- Drain must be managed with a state machine to be reproducible.
- State transitions must be indicator-based, and there is a high probability of failure if decisions are made only with timeouts.
Detailed description
Recommended state transitions:1. BOUND: Normal service
2. DRAIN_REQUESTED: unbind called
3. DRAINING: new=0, observe decrease in activity
4. GRACEFUL_STOP: Start App graceful shutdown
5. TERMINATED: Terminate process, replace VM
6. REBOUND: new version bind
The conditions for each state must be defined numerically.
DRAINING -> GRACEFUL_STOP: active is below the threshold (e.g. 20)GRACEFUL_STOP -> TERMINATED: FIN completed within shutdown waiting time- Notification/approval before forced termination when timeout is exceeded
Practical tips
- The drain status is recorded centrally in the distribution system (e.g. Argo/Jenkins).
- “Hold + operator confirmation” is often safer than automatic rollback when state transition fails.
Common Mistakes
- The drain status is only left in the log and is not combined with metrics.
- There is no fencing token in the state transition, so duplicate operations occur.
3. Diagram: TCP state and drain state machine
Key takeaways
- TCP state flow and operational state flow must be separated and visualized to speed up cause tracking.
Detailed description
Drain State Machine
BOUND
-> DRAIN_REQUESTED (unbind)
-> DRAINING (new_conn=0)
-> GRACEFUL_STOP (active_conn <= threshold)
-> TERMINATED
-> REBOUND
Abort path:
DRAINING -> FORCE_CLOSE -> RST spike
Practical tips
- If you overlay the TCP state and drain state on the time axis in the same dashboard, the cause of failure is immediately revealed.
Common Mistakes
- Missing state transition timing by only looking at a single point of view capture and drawing conclusions.
4. Operational scenario: Deployment with increased RST instead of FIN
Key takeaways
- Symptoms: Client reconnections rapidly and reset count rapidly at each deployment point.
- Essence: Failure to take the graceful path and moving to the forced shutdown path
Detailed description
Observation log:
[lb] backend=vm-a reset_out=182/s new_conn=0 active_conn=214
[nginx] worker process exited on signal 9
[app] Shutdown hook not completed before SIGKILL
Cause:
- systemd stop timeout (30s) was shorter than app graceful timeout (60s).
- Nginx/App shutdown order reversed
Solved:
- Reset app graceful timeout to 45s, systemd timeout to 90s
- Shutdown sequence: app graceful -> nginx quit -> vm terminate
- Add warning notification before forced shutdown in case of drain failure
Practical tips
- Add the
RST rate < 임계치verification step to your deployment script.
Common Mistakes
- It only increases the drain time and does not change the order of termination signals.
Operational Checklist
- Are drain state machines and timeout policies documented?
- Are the
TIME_WAIT,CLOSE_WAIT,RSTthresholds defined? - Is the termination signal order (SIGTERM/SIGQUIT/SIGKILL) specified?
- Does the deployment automation detect and abort state transition failures?
Summary
Connection draining is a state transition issue, not a time issue. Graceful Shutdown can be reproduced only by observing the TCP lifecycle and operating the drain state machine based on indicators.
Next episode preview
In the next part, we will cover why drain does not end in the WebSocket Load Balancing environment and five practical strategies for controlling long-lived connections.
Reference link
- RFC 9293: TCP States and Events
- Linux
ssman page - Nginx - Controlling nginx
- 블로그: Disaster Recovery RTO/RPO Guide
Series navigation
- Previous post: Part 2. L4 Load Balancer의 bind/unbind와 connection lifecycle
- Next post: Part 4. WebSocket 환경에서 drain이 실패하는 이유