Part 5. Non-disruptive deployment strategy in VM + Nginx environment
The Zero Downtime Deployment procedure that can actually be operated in the L4 -> Nginx -> App -> WebSocket architecture is presented in a 7-step runbook.
Series: Graceful Drain 완벽 가이드
총 7편 구성. 현재 5편을 보고 있습니다.
- 01Part 1. Why does service crash if you simply remove the server?
- 02Part 2. L4 Load Balancer bind/unbind and connection lifecycle
- 03Part 3. Internal operation of Connection Draining (TCP perspective)
- 04Part 4. Why drain fails in a WebSocket environment
- 05Part 5. Non-disruptive deployment strategy in VM + Nginx environmentCURRENT
- 06Part 6. Analysis of actual failure cases (SRE perspective)
- 07Part 7. Operation checklist and verification method
Part 5. Non-disruptive deployment strategy in VM + Nginx environment
This article provides a runbook that actually replaces the server in the L4 Load Balancer -> VM -> Nginx -> App -> WebSocket/HTTP structure. There is only one goal. By combining Connection Draining and Graceful Shutdown, Zero Downtime Deployment is made reproducible.
Based on version
- Linux Kernel 5.15+
- Nginx 1.25+
- systemd 252+ -JVM 21
1. Settings that need to be fixed before deployment
Key takeaways
- Settings should ensure “termination probability” rather than “performance”.
- View WebSocket and keepalive settings separately.
Detailed description
Nginx WebSocket Proxy Example default settings:
location /ws {
proxy_pass http://app_upstream;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_read_timeout 3600;
proxy_send_timeout 3600;
}
keepalive_timeout 65;
Interpretation of settings:
proxy_read_timeout 3600: Can wait up to 1 hour even if there are no frames from the serverproxy_send_timeout 3600: Allow client transmission delaykeepalive_timeout 65: HTTP keepalive lifetime
If a service has many WebSockets, the timeout should be readjusted to match the drain window rather than being arbitrarily long.
Practical tips
keepalive_timeoutis tuned with API delay/connection reuse rate.- Separating WebSocket and general API by location or port makes drain control easier.
Common Mistakes
- Keepalive and websocket timeout are set to the same standard.
- Do not re-measure the connection lifetime distribution after changing Nginx settings.
2. Non-disruptive deployment 7-step runbook
Key takeaways
- When the order changes, graceful changes into hard cut.
- The order of L4, App, and Nginx shutdown must be fixed.
Detailed description
Step 1. Start L4 drain
- Switch target VM to
unbind + draining
Step 2. Block new connections
- Check if
new connectionsdrops to 0
Step 3. Check the decrease in existing connections
- Observe the status of
active connections,ss -ant
Step 4. Application graceful shutdown
- Reject new requests + induce termination of existing sessions
Step 5. Shut down Nginx
- The worker terminates by cleaning up open connections with
nginx -s quit.
Step 6. deploy
- Binary/image replacement, health check confirmation
Step 7. server bind
- Bind again to L4, check gradual inflow of traffic
Practical tips
- Set “progress conditions” and “stop conditions” in numbers for each step.
- Specify the maximum time required for each step and establish notification/approval procedures when exceeded.
Common Mistakes
- Execute Step 4 and Step 5 in reverse order to trigger RST.
- Mass inflow is allowed immediately after Step 7, causing a sharp increase in delay without warm-up.
# 권장 종료 순서 예시
curl -X POST http://127.0.0.1:8080/internal/drain/start
sleep 10
nginx -s quit
systemctl stop my-app.service
3. Diagram: rolling deploy architecture
Key takeaways
- Rolling deployment is not a “one out, one in” operation, but a state transition pipeline.
Detailed description
Rolling set (2 VM example)
T0: VM-A(bound), VM-B(bound)
T1: VM-A(draining), VM-B(bound)
T2: VM-A(deploying), VM-B(bound)
T3: VM-A(bound new ver), VM-B(bound)
T4: VM-B(draining), VM-A(bound new ver)
Practical tips
- The blast radius can be minimized by following the policy of draining only one unit at a time.
Common Mistakes
- False negatives are created by not adjusting the health check interval and fail threshold during rolling.
4. Operational Scenario: Short 5xx surge immediately after redistribution
Key takeaways
- Symptom: 5xx increases rapidly for 30 to 60 seconds immediately after bind
- Cause: app warm-up and L4 bind timing mismatch
Detailed description
Log example:
[lb] bind vm-a at 14:02:10
[app] cache warmup started at 14:02:10
[nginx] upstream prematurely closed connection while reading response header
Solved:
- Do not pass the LB health check before completing the warm-up endpoint.
- Lower the traffic weight for 1~2 minutes after bind.
- Establish a readiness policy that takes the JVM JIT warm-up period into account.
Practical tips
- Readiness is defined based on dependency ready, not process up.
Common Mistakes
- Bind immediately because the process has started.
Operational Checklist
- Are the WebSocket/Nginx timeout and drain window aligned?
- Is the 7-step runbook reflected in the automation pipeline?
- Is the shutdown order (app -> nginx -> vm) enforced?
- Is warm-up completion verified before bind?
- Are step-by-step stopping conditions (error rate, reset rate) set?
Summary
Non-disruptive deployment of the VM + Nginx architecture is an order, not a feature. L4 drain and App/Nginx graceful must be operated with the same state machine to stably repeat Zero Downtime Deployment.
Next episode preview
In the next section, we analyze three actual failure cases from an SRE perspective. If removed without drain, organize WebSocket drain failure and keepalive setting errors into symptoms/logs/causes/solutions.
Reference link
Series navigation
- Previous post: Part 4. WebSocket 환경에서 drain이 실패하는 이유
- Next post: Part 6. 실제 장애 사례 분석 (SRE 관점)