Part 2. L4 Load Balancer bind/unbind and connection lifecycle
We dissect what connection state changes actually occur in L4 Load Balancer by bind, unbind, and connection draining based on the TCP lifecycle.
Series: Graceful Drain 완벽 가이드
총 7편 구성. 현재 2편을 보고 있습니다.
- 01Part 1. Why does service crash if you simply remove the server?
- 02Part 2. L4 Load Balancer bind/unbind and connection lifecycleCURRENT
- 03Part 3. Internal operation of Connection Draining (TCP perspective)
- 04Part 4. Why drain fails in a WebSocket environment
- 05Part 5. Non-disruptive deployment strategy in VM + Nginx environment
- 06Part 6. Analysis of actual failure cases (SRE perspective)
- 07Part 7. Operation checklist and verification method
Part 2. L4 Load Balancer bind/unbind and connection lifecycle
To properly apply L4 Load Balancer, Connection Draining, Graceful Shutdown, and Zero Downtime Deployment in practice, you must first understand what bind/unbind changes at the packet level. This article organizes the connection life cycle based on the L4 -> Nginx -> App -> WebSocket structure.
Based on version
- Linux Kernel 5.15+
- Nginx 1.25+
- HAProxy 2.8+ or equivalent L4 equipment -JVM 21
1. What bind/unbind actually changes
Key takeaways
bindregisters the server as a “backend candidate capable of receiving new connections.”unbindblocks “only new connections” and can maintain existing connections or force them to be disconnected.- What is important in operation is not the unbind itself, but the policy after unbinding (
drainvshard close).
Detailed description
From an L4 perspective, bind/unbind is a change in the set of routing destinations.
- bind: Add VM to hash/round robin target pool
- unbind: Remove VM from target pool
- drain: Stop assigning new TCP 3-way handshake (SYN) to the VM targeted for removal.
In other words, unbind is a control plane operation, and the actual existence of a failure is determined by how the existing session is handled in the data plane.
drain on: Existing sessions wait for FIN-based normal terminationdrain offor force shutdown: Increases the likelihood of RST occurring
Practical tips
- Maintenance conversion is always fixed in the order
unbind -> drain -> terminate. - During drain, simultaneously check whether
new connectionsis 0 andactive connectionsis decreasing. - Force the drain status check API into the deployment pipeline.
Common Mistakes
- Call VM termination immediately after unbinding.
- It is mistakenly judged as drain completion just because the health check fails.
- The drain option name for each equipment is different, so the operation runbook is different.
2. Connection lifecycle: From SYN to FIN
Key takeaways
- Normal termination is based on
FIN, and fault termination is mainly indicated byRSTor timeout. - In L4,
Connection Drainingis ultimately "block new SYN + wait for purge of existing ESTABLISHED".
Detailed description
Connection flow for a typical request:
- Client -> L4:
SYN - L4 -> Nginx (backend VM): Forward
SYN - After completing the 3-way handshake
ESTABLISHED - Data exchange (HTTP keepalive or WebSocket)
- At the end
FIN -> ACK -> FIN -> ACK
The problem comes after unbinding.
- The new
SYNgoes to another VM, - The existing
ESTABLISHEDremains in the existing VM. - If the VM is forcibly shut down, it terminates with RST/timeout instead of FIN, resulting in an explosion of client errors.
Practical tips
- Drain observation indicators are grouped into 3 axes:
new,active, andreset. - Don’t just look at the L4 log, but also look at ESTABLISHED/CLOSE_WAIT with
ss -antin the VM.
Common Mistakes
- Only look at the HTTP request success rate and miss the increase in TCP reset.
- Underestimate drain time by assuming keepalive connections as short-lived requests.
# 서버별 연결 상태 확인
ss -ant | awk 'NR==1 || /:443/'
# reset 패턴 추적(커널 counters)
netstat -s | egrep -i 'reset|failed|retrans'
3. Architecture diagram: bind/unbind lifecycle
Key takeaways
- The key turning point is
BOUND -> DRAINING -> UNBOUNDstate movement. - In this section, the application must prepare
Graceful Shutdown.
Detailed description
[Before]
Client -> L4 -> VM-A (new + existing)
Client -> L4 -> VM-B (new + existing)
[After unbind VM-A + drain]
Client -> L4 -X-> VM-A (new blocked)
Client -> L4 ---> VM-B (new accepted)
Existing VM-A connections remain until FIN/timeout
Practical tips
- Standardizing LB status as
BOUND,DRAINING,DETACHEDin internal documents reduces communication errors between operators.
Common Mistakes
- Only records the drain state abstractly and does not leave transition events (time/indicators).
4. L4 vs L7 Load Balancer Selection Criteria
Key takeaways
- L4 is strong in TCP session stability and performance, and L7 is strong in request unit control.
- If there are many WebSocket long-lived connections, the L4-centric design is simple, but the drain policy must be stricter.
Detailed description
| Item | L4 Load Balancer | L7 Load Balancer |
|---|---|---|
| control unit | TCP connection | HTTP request/stream |
| Advantages | Low overhead, high throughput | Routing/Header/Cookie Based Control |
| Weaknesses | request level policy limits | Proxy costs increase |
| WebSockets | Beneficial for maintaining sessions | Upgrade Processing Implementation Quality Matters |
| Types of Disabilities | RST/timeout centered | 5xx/timeout centered |
Practical tips
WebSocket Load Balancinggenerally has an L4-only or L4+Nginx (L7) mixed structure.Zero Downtime Deploymentis possible only when draining from L4 and performingGraceful Shutdowntogether in Nginx/App.
Common Mistakes
- Increases delay by adding excessive layers even though the L7 function is not needed.
- Because it is L4, it is considered safe and app-level graceful processing is omitted.
Operational Checklist
- Did you switch the target VM to
unbind + drainbefore starting maintenance? - Is
new connections=0confirmed during drain? - Doesn’t
reset countincrease rapidly compared to usual? - Have you confirmed the changes in
ESTABLISHED,CLOSE_WAIT, andTIME_WAITin the VM? - Do the app graceful timeout and LB drain timeout not conflict?
Summary
bind/unbind is not just a device setting, but connection lifetime control. The actual Graceful Shutdown and Zero Downtime Deployment are established only when the shutdown sequence of L4, Nginx, and App is aligned with Connection Draining as the center.
Next episode preview
In the next part, we deeply analyze the internal operation of Connection Draining from the perspective of TCP state transitions (SYN, FIN, RST, TIME_WAIT, CLOSE_WAIT).
Reference link
- RFC 9293: Transmission Control Protocol
- HAProxy Documentation - Connection Handling
- Nginx WebSocket Proxying
- 블로그: Queue Backpressure Patterns
Series navigation
- Previous post: Part 1. 왜 서버를 그냥 제거하면 서비스가 터질까
- Next post: Part 3. Connection Draining의 내부 동작 (TCP 관점)