Part 5. Non-disruptive deployment strategy in VM + Nginx environment

This article provides a runbook that actually replaces the server in the L4 Load Balancer -> VM -> Nginx -> App -> WebSocket/HTTP structure. There is only one goal. By combining Connection Draining and Graceful Shutdown, Zero Downtime Deployment is made reproducible.

Based on version

Linux Kernel 5.15+
Nginx 1.25+
systemd 252+ -JVM 21

1. Settings that need to be fixed before deployment

Key takeaways

Settings should ensure “termination probability” rather than “performance”.
View WebSocket and keepalive settings separately.

Detailed description

Nginx WebSocket Proxy Example default settings:

location /ws {
    proxy_pass http://app_upstream;
    proxy_http_version 1.1;
    proxy_set_header Upgrade $http_upgrade;
    proxy_set_header Connection "upgrade";
    proxy_read_timeout 3600;
    proxy_send_timeout 3600;
}

keepalive_timeout 65;

Interpretation of settings:

proxy_read_timeout 3600: Can wait up to 1 hour even if there are no frames from the server
proxy_send_timeout 3600: Allow client transmission delay
keepalive_timeout 65: HTTP keepalive lifetime

If a service has many WebSockets, the timeout should be readjusted to match the drain window rather than being arbitrarily long.

Practical tips

keepalive_timeout is tuned with API delay/connection reuse rate.
Separating WebSocket and general API by location or port makes drain control easier.

Common Mistakes

Keepalive and websocket timeout are set to the same standard.
Do not re-measure the connection lifetime distribution after changing Nginx settings.

2. Non-disruptive deployment 7-step runbook

Key takeaways

When the order changes, graceful changes into hard cut.
The order of L4, App, and Nginx shutdown must be fixed.

Detailed description

Step 1. Start L4 drain

Switch target VM to unbind + draining

Step 2. Block new connections

Check if new connections drops to 0

Step 3. Check the decrease in existing connections

Observe the status of active connections, ss -ant

Step 4. Application graceful shutdown

Reject new requests + induce termination of existing sessions

Step 5. Shut down Nginx

The worker terminates by cleaning up open connections with nginx -s quit.

Step 6. deploy

Binary/image replacement, health check confirmation

Step 7. server bind

Bind again to L4, check gradual inflow of traffic

Practical tips

Set “progress conditions” and “stop conditions” in numbers for each step.
Specify the maximum time required for each step and establish notification/approval procedures when exceeded.

Common Mistakes

Execute Step 4 and Step 5 in reverse order to trigger RST.
Mass inflow is allowed immediately after Step 7, causing a sharp increase in delay without warm-up.

# 권장 종료 순서 예시
curl -X POST http://127.0.0.1:8080/internal/drain/start
sleep 10
nginx -s quit
systemctl stop my-app.service

3. Diagram: rolling deploy architecture

Key takeaways

Rolling deployment is not a “one out, one in” operation, but a state transition pipeline.

Detailed description

Mermaid diagram rendering...

Rolling set (2 VM example)

T0: VM-A(bound), VM-B(bound)
T1: VM-A(draining), VM-B(bound)
T2: VM-A(deploying), VM-B(bound)
T3: VM-A(bound new ver), VM-B(bound)
T4: VM-B(draining), VM-A(bound new ver)

Practical tips

The blast radius can be minimized by following the policy of draining only one unit at a time.

Common Mistakes

False negatives are created by not adjusting the health check interval and fail threshold during rolling.

4. Operational Scenario: Short 5xx surge immediately after redistribution

Key takeaways

Symptom: 5xx increases rapidly for 30 to 60 seconds immediately after bind
Cause: app warm-up and L4 bind timing mismatch

Detailed description

Log example:

[lb] bind vm-a at 14:02:10
[app] cache warmup started at 14:02:10
[nginx] upstream prematurely closed connection while reading response header

Solved:

Do not pass the LB health check before completing the warm-up endpoint.
Lower the traffic weight for 1~2 minutes after bind.
Establish a readiness policy that takes the JVM JIT warm-up period into account.

Practical tips

Readiness is defined based on dependency ready, not process up.

Common Mistakes

Bind immediately because the process has started.

Operational Checklist

Are the WebSocket/Nginx timeout and drain window aligned?
Is the 7-step runbook reflected in the automation pipeline?
Is the shutdown order (app -> nginx -> vm) enforced?
Is warm-up completion verified before bind?
Are step-by-step stopping conditions (error rate, reset rate) set?

Summary

Non-disruptive deployment of the VM + Nginx architecture is an order, not a feature. L4 drain and App/Nginx graceful must be operated with the same state machine to stably repeat Zero Downtime Deployment.

Next episode preview

In the next section, we analyze three actual failure cases from an SRE perspective. If removed without drain, organize WebSocket drain failure and keepalive setting errors into symptoms/logs/causes/solutions.

Reference link

Previous post: Part 4. WebSocket 환경에서 drain이 실패하는 이유
Next post: Part 6. 실제 장애 사례 분석 (SRE 관점)

Part 5. Non-disruptive deployment strategy in VM + Nginx environment

Series: Graceful Drain 완벽 가이드

Part 5. Non-disruptive deployment strategy in VM + Nginx environment

Based on version

1. Settings that need to be fixed before deployment

Key takeaways

Detailed description

Practical tips

Common Mistakes

2. Non-disruptive deployment 7-step runbook

Key takeaways

Detailed description

Practical tips

Common Mistakes

3. Diagram: rolling deploy architecture

Key takeaways

Detailed description

Practical tips

Common Mistakes

4. Operational Scenario: Short 5xx surge immediately after redistribution

Key takeaways

Detailed description

Practical tips

Common Mistakes

Operational Checklist

Summary

Next episode preview

Reference link

Series navigation

Comments