4 min read

Part 7. Operation checklist and verification method

We present a practical checklist to verify Connection Draining and Graceful Shutdown readiness within 10 minutes immediately before deployment in the L4/Nginx/App/WebSocket structure.

Series: Graceful Drain 완벽 가이드

7편 구성. 현재 7편을 보고 있습니다.

Progress100% Complete
SERIES COMPLETE

Part 7. Operation checklist and verification method

This is the last article in the series. The goal is simple. L4 Load Balancer, Nginx WebSocket Proxy, application Graceful Shutdown readiness is verified within 10 minutes immediately before deployment, and the possibility of failure is numerically blocked.

Based on version

  • Linux Kernel 5.15+
  • Nginx 1.25+ -JVM 21
  • Common observation stack (Prometheus/Grafana or equivalent solution)

1. Verification of key indicators before deployment

Key takeaways

  • Indicators are viewed as at least 3 layers (LB, Nginx, Linux).
  • Even if only one layer is normal, the entire layer can fail.

Detailed description

LB indicators:

  • active connections
  • new connections
  • reset count

Nginx metrics:

  • worker connections
  • keepalive Related indicators (activity/reuse rate)

Linux verification command:

ss -s
ss -ant
netstat -an
lsof -i

Example verification criteria:

  • After starting drain new connections == 0
  • active connections reduced to target speed
  • There is no sharp increase in reset count compared to normal times.
  • CLOSE_WAIT No accumulation

Practical tips

  • Collect the same command three times at 10-minute intervals just before deployment to see trends.
  • Separate numerical thresholds by service (HTTP-centered vs. WebSocket-centered).

Common Mistakes

  • Force deployment by looking at a single snapshot only once.

2. Pre-rehearsal: Drain verification scenario

Key takeaways

  • Non-disruptive deployment must be verified through “rehearsal” before actual deployment.
  • The most realistic way to rehearse is to use a portion of the production traffic.

Detailed description

Rehearsal Procedure:

  1. Select one target VM as unbind + drain
  2. Monitor new, active, reset for 10 minutes.
  3. Call app graceful API
  4. Perform nginx -s quit
  5. After shutdown/restart bind
  6. Check error rate and connection index recovery

Success Conditions:

  • No increase in overall user error rate
  • No reset spike
  • Predictable distribution end time

Failure conditions:- Stagnant decline in activity (especially WebSocket)

  • reconnect surge
  • CLOSE_WAIT accumulation

Practical tips

  • Rehearsal failure is cheaper than deployment failure. Immediately reflect the cause of failure in the runbook.

Common Mistakes

  • Rehearsal occurs only in staging and does not reflect production characteristics (actual session lifespan).

3. Diagram: Final Verification Pipeline

Key takeaways

  • The final verification must be a closed loop of “indicator verification -> drain -> graceful -> bind”.

Detailed description

Mermaid diagram rendering...
10-minute validation loop

minute 0-2   : baseline metrics capture
minute 2-5   : drain and observe new/active/reset
minute 5-7   : graceful shutdown + nginx quit
minute 7-10  : bind + recovery verification
Pre-checkDrainGraceful StopDeployVerify

Practical tips

  • Leaving automatically collected logs at each stage of the loop makes retrospectives and improvements easier.

Common Mistakes

  • As the verification loop is performed only manually, quality deviations for each person in charge increase.

4. 10-minute inspection checklist that operators must check

Key takeaways

  • Checklists should be “deployment approval criteria” rather than “to-do lists.”

Detailed description| Item | How to check | Normal standards | Action when a problem occurs |

| --- | --- | --- | --- | | LB active connections | LB Dashboard | Moderate decline trend | Extend drain, apply session TTL policy | | LB new connections | LB Dashboard | drain 0 from target server | unbind status revalidation | | LB reset count | LB Dashboard/Log | Maintain peacetime range | Check shutdown sequence, stop forced shutdown | | Nginx worker connections | stub_status/metrics | No sudden spikes | Upstream overload/reconnection congestion check | | Nginx keepalive status | access/metrics | Stable reuse rate | readjust keepalive_timeout | | Linux TCP Summary | ss -s | CLOSE_WAIT No abnormal accumulation | app close path modification | | Linux TCP details | ss -ant | ESTABLISHED decrease, no RST spike | Recheck drain/shutdown sequence | | socket distribution | netstat -an | TIME_WAIT increase speed stable | keepalive/churn tuning | | Ports per process | lsof -i | No abnormal orphan socket | Restart after identifying the leaking process | | App graceful status | Internal health/drain endpoint | shutdown hook completed | systemd timeout/order adjustment |

Practical tips

  • Attach the table as is to the runbook approval stage and leave an automatic check result.

Common Mistakes

  • Keep the checklist only as a document and do not connect it to the deployment pipeline.

Summary

Connection Draining is an operating procedure, not an equipment option. L4 Load Balancer, Nginx, App, WebSocket By verifying the indicators of each layer in a 10-minute loop, the probability of success in uninterrupted deployment can be structurally increased.

Next episode previewThe series ends with this episode. In a follow-up article, we will cover how to map the same principle to Pod Disruption Budget, preStop, and readiness gate in the Kubernetes environment.

Series navigation

Comments