#sre

13 posts found.

devops
3 min read
We analyze the reasons why hard cuts lead to failures in L4 Load Balancer and Nginx-WebSocket architecture from the connection lifecycle perspective.
devops
4 min read
We dissect what connection state changes actually occur in L4 Load Balancer by bind, unbind, and connection draining based on the TCP lifecycle.
devops
4 min read
We decompose Connection Draining into TCP lifecycle and explain how SYN, FIN, RST, TIME_WAIT, and CLOSE_WAIT appear in actual operational indicators.
devops
4 min read
The phenomenon of connection draining not ending due to WebSocket persistent connection is analyzed from the perspective of TCP, Nginx proxy, and application, and five solutions are presented.
devops
4 min read
The Zero Downtime Deployment procedure that can actually be operated in the L4 -> Nginx -> App -> WebSocket architecture is presented in a 7-step runbook.
devops
3 min read
Actual failure patterns caused by server removal without drain, WebSocket drain failure, and keepalive setting imbalance are analyzed through symptoms, logs, causes, and solutions.
devops
4 min read
We present a practical checklist to verify Connection Draining and Graceful Shutdown readiness within 10 minutes immediately before deployment in the L4/Nginx/App/WebSocket structure.
llm
4 min read
We summarize the reasons and operating patterns for retries, timeouts, fallbacks, and circuit breakers in LLM systems that should be designed differently from regular APIs.
llm
4 min read
We present an LLM/Agent reference architecture that combines prompting, evaluation, reliability, cost, security, and observability into one operating system.
Canary Release Metric Gate 설계 cover image
2 min read
An operating pattern that determines promotion/discontinuation by automatically determining error rate and delay time in gradual deployment
OpenTelemetry Observability Baseline cover image
2 min read
Measurement standards that connect logs, metrics, and traces to reduce time to cause of failure
Incident Response Runbook Design cover image
2 min read
How to create a consistent response flow from alarm reception to communication, recovery, and post-analysis
Disaster recovery RTO/RPO definition and practice cover image
2 min read
An operation guide that increases DR reliability by not only completing backups but also including recovery rehearsals