Part 7. Operation checklist and verification method
We present a practical checklist to verify Connection Draining and Graceful Shutdown readiness within 10 minutes immediately before deployment in the L4/Nginx/App/WebSocket structure.
Series: Graceful Drain 완벽 가이드
총 7편 구성. 현재 7편을 보고 있습니다.
- 01Part 1. Why does service crash if you simply remove the server?
- 02Part 2. L4 Load Balancer bind/unbind and connection lifecycle
- 03Part 3. Internal operation of Connection Draining (TCP perspective)
- 04Part 4. Why drain fails in a WebSocket environment
- 05Part 5. Non-disruptive deployment strategy in VM + Nginx environment
- 06Part 6. Analysis of actual failure cases (SRE perspective)
- 07Part 7. Operation checklist and verification methodCURRENT
Part 7. Operation checklist and verification method
This is the last article in the series. The goal is simple. L4 Load Balancer, Nginx WebSocket Proxy, application Graceful Shutdown readiness is verified within 10 minutes immediately before deployment, and the possibility of failure is numerically blocked.
Based on version
- Linux Kernel 5.15+
- Nginx 1.25+ -JVM 21
- Common observation stack (Prometheus/Grafana or equivalent solution)
1. Verification of key indicators before deployment
Key takeaways
- Indicators are viewed as at least 3 layers (LB, Nginx, Linux).
- Even if only one layer is normal, the entire layer can fail.
Detailed description
LB indicators:
active connectionsnew connectionsreset count
Nginx metrics:
worker connectionskeepaliveRelated indicators (activity/reuse rate)
Linux verification command:
ss -s
ss -ant
netstat -an
lsof -i
Example verification criteria:
- After starting drain
new connections == 0 active connectionsreduced to target speed- There is no sharp increase in
reset countcompared to normal times. CLOSE_WAITNo accumulation
Practical tips
- Collect the same command three times at 10-minute intervals just before deployment to see trends.
- Separate numerical thresholds by service (HTTP-centered vs. WebSocket-centered).
Common Mistakes
- Force deployment by looking at a single snapshot only once.
2. Pre-rehearsal: Drain verification scenario
Key takeaways
- Non-disruptive deployment must be verified through “rehearsal” before actual deployment.
- The most realistic way to rehearse is to use a portion of the production traffic.
Detailed description
Rehearsal Procedure:
- Select one target VM as
unbind + drain - Monitor
new,active,resetfor 10 minutes. - Call app graceful API
- Perform
nginx -s quit - After shutdown/restart
bind - Check error rate and connection index recovery
Success Conditions:
- No increase in overall user error rate
- No reset spike
- Predictable distribution end time
Failure conditions:- Stagnant decline in activity (especially WebSocket)
- reconnect surge
- CLOSE_WAIT accumulation
Practical tips
- Rehearsal failure is cheaper than deployment failure. Immediately reflect the cause of failure in the runbook.
Common Mistakes
- Rehearsal occurs only in staging and does not reflect production characteristics (actual session lifespan).
3. Diagram: Final Verification Pipeline
Key takeaways
- The final verification must be a closed loop of “indicator verification -> drain -> graceful -> bind”.
Detailed description
10-minute validation loop
minute 0-2 : baseline metrics capture
minute 2-5 : drain and observe new/active/reset
minute 5-7 : graceful shutdown + nginx quit
minute 7-10 : bind + recovery verification
Practical tips
- Leaving automatically collected logs at each stage of the loop makes retrospectives and improvements easier.
Common Mistakes
- As the verification loop is performed only manually, quality deviations for each person in charge increase.
4. 10-minute inspection checklist that operators must check
Key takeaways
- Checklists should be “deployment approval criteria” rather than “to-do lists.”
Detailed description| Item | How to check | Normal standards | Action when a problem occurs |
| --- | --- | --- | --- |
| LB active connections | LB Dashboard | Moderate decline trend | Extend drain, apply session TTL policy |
| LB new connections | LB Dashboard | drain 0 from target server | unbind status revalidation |
| LB reset count | LB Dashboard/Log | Maintain peacetime range | Check shutdown sequence, stop forced shutdown |
| Nginx worker connections | stub_status/metrics | No sudden spikes | Upstream overload/reconnection congestion check |
| Nginx keepalive status | access/metrics | Stable reuse rate | readjust keepalive_timeout |
| Linux TCP Summary | ss -s | CLOSE_WAIT No abnormal accumulation | app close path modification |
| Linux TCP details | ss -ant | ESTABLISHED decrease, no RST spike | Recheck drain/shutdown sequence |
| socket distribution | netstat -an | TIME_WAIT increase speed stable | keepalive/churn tuning |
| Ports per process | lsof -i | No abnormal orphan socket | Restart after identifying the leaking process |
| App graceful status | Internal health/drain endpoint | shutdown hook completed | systemd timeout/order adjustment |
Practical tips
- Attach the table as is to the runbook approval stage and leave an automatic check result.
Common Mistakes
- Keep the checklist only as a document and do not connect it to the deployment pipeline.
Summary
Connection Draining is an operating procedure, not an equipment option. L4 Load Balancer, Nginx, App, WebSocket By verifying the indicators of each layer in a 10-minute loop, the probability of success in uninterrupted deployment can be structurally increased.
Next episode previewThe series ends with this episode. In a follow-up article, we will cover how to map the same principle to Pod Disruption Budget, preStop, and readiness gate in the Kubernetes environment.
Reference link
Series navigation
- Previous post: Part 6. 실제 장애 사례 분석 (SRE 관점)
- Next post: None (last part of this series)