Part 7. Operation checklist and verification method

This is the last article in the series. The goal is simple. L4 Load Balancer, Nginx WebSocket Proxy, application Graceful Shutdown readiness is verified within 10 minutes immediately before deployment, and the possibility of failure is numerically blocked.

Based on version

Linux Kernel 5.15+
Nginx 1.25+ -JVM 21
Common observation stack (Prometheus/Grafana or equivalent solution)

1. Verification of key indicators before deployment

Key takeaways

Indicators are viewed as at least 3 layers (LB, Nginx, Linux).
Even if only one layer is normal, the entire layer can fail.

Detailed description

LB indicators:

active connections
new connections
reset count

Nginx metrics:

worker connections
keepalive Related indicators (activity/reuse rate)

Linux verification command:

ss -s
ss -ant
netstat -an
lsof -i

Example verification criteria:

After starting drain new connections == 0
active connections reduced to target speed
There is no sharp increase in reset count compared to normal times.
CLOSE_WAIT No accumulation

Practical tips

Collect the same command three times at 10-minute intervals just before deployment to see trends.
Separate numerical thresholds by service (HTTP-centered vs. WebSocket-centered).

Common Mistakes

Force deployment by looking at a single snapshot only once.

2. Pre-rehearsal: Drain verification scenario

Key takeaways

Non-disruptive deployment must be verified through “rehearsal” before actual deployment.
The most realistic way to rehearse is to use a portion of the production traffic.

Detailed description

Rehearsal Procedure:

Select one target VM as unbind + drain
Monitor new, active, reset for 10 minutes.
Call app graceful API
Perform nginx -s quit
After shutdown/restart bind
Check error rate and connection index recovery

Success Conditions:

No increase in overall user error rate
No reset spike
Predictable distribution end time

Failure conditions:- Stagnant decline in activity (especially WebSocket)

reconnect surge
CLOSE_WAIT accumulation

Practical tips

Rehearsal failure is cheaper than deployment failure. Immediately reflect the cause of failure in the runbook.

Common Mistakes

Rehearsal occurs only in staging and does not reflect production characteristics (actual session lifespan).

3. Diagram: Final Verification Pipeline

Key takeaways

The final verification must be a closed loop of “indicator verification -> drain -> graceful -> bind”.

Detailed description

Mermaid diagram rendering...

10-minute validation loop

minute 0-2   : baseline metrics capture
minute 2-5   : drain and observe new/active/reset
minute 5-7   : graceful shutdown + nginx quit
minute 7-10  : bind + recovery verification

Practical tips

Leaving automatically collected logs at each stage of the loop makes retrospectives and improvements easier.

Common Mistakes

As the verification loop is performed only manually, quality deviations for each person in charge increase.

4. 10-minute inspection checklist that operators must check

Key takeaways

Checklists should be “deployment approval criteria” rather than “to-do lists.”

Detailed description| Item | How to check | Normal standards | Action when a problem occurs |

| --- | --- | --- | --- | | LB active connections | LB Dashboard | Moderate decline trend | Extend drain, apply session TTL policy | | LB new connections | LB Dashboard | drain 0 from target server | unbind status revalidation | | LB reset count | LB Dashboard/Log | Maintain peacetime range | Check shutdown sequence, stop forced shutdown | | Nginx worker connections | stub_status/metrics | No sudden spikes | Upstream overload/reconnection congestion check | | Nginx keepalive status | access/metrics | Stable reuse rate | readjust keepalive_timeout | | Linux TCP Summary | ss -s | CLOSE_WAIT No abnormal accumulation | app close path modification | | Linux TCP details | ss -ant | ESTABLISHED decrease, no RST spike | Recheck drain/shutdown sequence | | socket distribution | netstat -an | TIME_WAIT increase speed stable | keepalive/churn tuning | | Ports per process | lsof -i | No abnormal orphan socket | Restart after identifying the leaking process | | App graceful status | Internal health/drain endpoint | shutdown hook completed | systemd timeout/order adjustment |

Practical tips

Attach the table as is to the runbook approval stage and leave an automatic check result.

Common Mistakes

Keep the checklist only as a document and do not connect it to the deployment pipeline.

Summary

Connection Draining is an operating procedure, not an equipment option. L4 Load Balancer, Nginx, App, WebSocket By verifying the indicators of each layer in a 10-minute loop, the probability of success in uninterrupted deployment can be structurally increased.

Next episode previewThe series ends with this episode. In a follow-up article, we will cover how to map the same principle to `Pod Disruption Budget`, `preStop`, and `readiness gate` in the Kubernetes environment.

Reference link

Previous post: Part 6. 실제 장애 사례 분석 (SRE 관점)
Next post: None (last part of this series)

Part 7. Operation checklist and verification method

Series: Graceful Drain 완벽 가이드

Part 7. Operation checklist and verification method

Based on version

1. Verification of key indicators before deployment

Key takeaways

Detailed description

Practical tips

Common Mistakes

2. Pre-rehearsal: Drain verification scenario

Key takeaways

Detailed description

Practical tips

Common Mistakes

3. Diagram: Final Verification Pipeline

Key takeaways

Detailed description

Practical tips

Common Mistakes

4. 10-minute inspection checklist that operators must check

Key takeaways

Detailed description| Item | How to check | Normal standards | Action when a problem occurs |

Practical tips

Common Mistakes

Summary

Next episode previewThe series ends with this episode. In a follow-up article, we will cover how to map the same principle to `Pod Disruption Budget`, `preStop`, and `readiness gate` in the Kubernetes environment.

Reference link

Series navigation

Comments

Part 7. Operation checklist and verification method

Based on version

1. Verification of key indicators before deployment

Key takeaways

Detailed description

Practical tips

Common Mistakes

2. Pre-rehearsal: Drain verification scenario

Key takeaways

Detailed description

Practical tips

Common Mistakes

3. Diagram: Final Verification Pipeline

Key takeaways

Detailed description

Practical tips

Common Mistakes

4. 10-minute inspection checklist that operators must check

Key takeaways

Detailed description| Item | How to check | Normal standards | Action when a problem occurs |

Practical tips

Common Mistakes

Summary

Next episode previewThe series ends with this episode. In a follow-up article, we will cover how to map the same principle to Pod Disruption Budget, preStop, and readiness gate in the Kubernetes environment.

Reference link

Series navigation

Comments

Next episode previewThe series ends with this episode. In a follow-up article, we will cover how to map the same principle to `Pod Disruption Budget`, `preStop`, and `readiness gate` in the Kubernetes environment.