4 min read

Part 5. Non-disruptive deployment strategy in VM + Nginx environment

The Zero Downtime Deployment procedure that can actually be operated in the L4 -> Nginx -> App -> WebSocket architecture is presented in a 7-step runbook.

Series: Graceful Drain 완벽 가이드

7편 구성. 현재 5편을 보고 있습니다.

Part 5. Non-disruptive deployment strategy in VM + Nginx environment

This article provides a runbook that actually replaces the server in the L4 Load Balancer -> VM -> Nginx -> App -> WebSocket/HTTP structure. There is only one goal. By combining Connection Draining and Graceful Shutdown, Zero Downtime Deployment is made reproducible.

Based on version

  • Linux Kernel 5.15+
  • Nginx 1.25+
  • systemd 252+ -JVM 21

1. Settings that need to be fixed before deployment

Key takeaways

  • Settings should ensure “termination probability” rather than “performance”.
  • View WebSocket and keepalive settings separately.

Detailed description

Nginx WebSocket Proxy Example default settings:

location /ws {
    proxy_pass http://app_upstream;
    proxy_http_version 1.1;
    proxy_set_header Upgrade $http_upgrade;
    proxy_set_header Connection "upgrade";
    proxy_read_timeout 3600;
    proxy_send_timeout 3600;
}

keepalive_timeout 65;

Interpretation of settings:

  • proxy_read_timeout 3600: Can wait up to 1 hour even if there are no frames from the server
  • proxy_send_timeout 3600: Allow client transmission delay
  • keepalive_timeout 65: HTTP keepalive lifetime

If a service has many WebSockets, the timeout should be readjusted to match the drain window rather than being arbitrarily long.

Practical tips

  • keepalive_timeout is tuned with API delay/connection reuse rate.
  • Separating WebSocket and general API by location or port makes drain control easier.

Common Mistakes

  • Keepalive and websocket timeout are set to the same standard.
  • Do not re-measure the connection lifetime distribution after changing Nginx settings.

2. Non-disruptive deployment 7-step runbook

Key takeaways

  • When the order changes, graceful changes into hard cut.
  • The order of L4, App, and Nginx shutdown must be fixed.

Detailed description

Step 1. Start L4 drain

  • Switch target VM to unbind + draining

Step 2. Block new connections

  • Check if new connections drops to 0

Step 3. Check the decrease in existing connections

  • Observe the status of active connections, ss -ant

Step 4. Application graceful shutdown

  • Reject new requests + induce termination of existing sessions

Step 5. Shut down Nginx

  • The worker terminates by cleaning up open connections with nginx -s quit.

Step 6. deploy

  • Binary/image replacement, health check confirmation

Step 7. server bind

  • Bind again to L4, check gradual inflow of traffic

Practical tips

  • Set “progress conditions” and “stop conditions” in numbers for each step.
  • Specify the maximum time required for each step and establish notification/approval procedures when exceeded.

Common Mistakes

  • Execute Step 4 and Step 5 in reverse order to trigger RST.
  • Mass inflow is allowed immediately after Step 7, causing a sharp increase in delay without warm-up.
# 권장 종료 순서 예시
curl -X POST http://127.0.0.1:8080/internal/drain/start
sleep 10
nginx -s quit
systemctl stop my-app.service

3. Diagram: rolling deploy architecture

Key takeaways

  • Rolling deployment is not a “one out, one in” operation, but a state transition pipeline.

Detailed description

Mermaid diagram rendering...
Rolling set (2 VM example)

T0: VM-A(bound), VM-B(bound)
T1: VM-A(draining), VM-B(bound)
T2: VM-A(deploying), VM-B(bound)
T3: VM-A(bound new ver), VM-B(bound)
T4: VM-B(draining), VM-A(bound new ver)
L4 LBVM-A (Draining)VM-B (Bound)NginxApp

Practical tips

  • The blast radius can be minimized by following the policy of draining only one unit at a time.

Common Mistakes

  • False negatives are created by not adjusting the health check interval and fail threshold during rolling.

4. Operational Scenario: Short 5xx surge immediately after redistribution

Key takeaways

  • Symptom: 5xx increases rapidly for 30 to 60 seconds immediately after bind
  • Cause: app warm-up and L4 bind timing mismatch

Detailed description

Log example:

[lb] bind vm-a at 14:02:10
[app] cache warmup started at 14:02:10
[nginx] upstream prematurely closed connection while reading response header

Solved:

  1. Do not pass the LB health check before completing the warm-up endpoint.
  2. Lower the traffic weight for 1~2 minutes after bind.
  3. Establish a readiness policy that takes the JVM JIT warm-up period into account.

Practical tips

  • Readiness is defined based on dependency ready, not process up.

Common Mistakes

  • Bind immediately because the process has started.

Operational Checklist

  • Are the WebSocket/Nginx timeout and drain window aligned?
  • Is the 7-step runbook reflected in the automation pipeline?
  • Is the shutdown order (app -> nginx -> vm) enforced?
  • Is warm-up completion verified before bind?
  • Are step-by-step stopping conditions (error rate, reset rate) set?

Summary

Non-disruptive deployment of the VM + Nginx architecture is an order, not a feature. L4 drain and App/Nginx graceful must be operated with the same state machine to stably repeat Zero Downtime Deployment.

Next episode preview

In the next section, we analyze three actual failure cases from an SRE perspective. If removed without drain, organize WebSocket drain failure and keepalive setting errors into symptoms/logs/causes/solutions.

Series navigation

Comments