2 min read

Canary Release Metric Gate 설계

An operating pattern that determines promotion/discontinuation by automatically determining error rate and delay time in gradual deployment

Canary Release Metric Gate 설계 thumbnail

Introduction

Canary distribution is a strategy that limits risk to a small percentage, but without observation gates, it is just a simple step distribution. Many teams only adjust the distribution ratio and do not have clear criteria for interpreting indicators, so the timing of promotion is determined by guessing. This article presents step-by-step the design of a metric gate to automate canary promotion.

Canary Release Metric Gate 설계 커버
Wikimedia Commons 기반 무료 이미지

Problem definition

A canary without a gate is rather unstable. If the following symptoms are observed, the design should be improved:

  • Comparing error rate and latency only as absolute values ​​does not reflect the impact of traffic changes.
  • Quality judgment is biased because business KPIs and system KPIs are separated.
  • Promotion/suspension decision-making authority is ambiguous, so response to failures is slow.

The key is to layer KPIs. View system stability, user experience, and business conversion rates simultaneously.

Key concepts

perspectiveDesign criteriaVerification points
system gateerror rate, p95 latencyRate of change compared to baseline
User GateLCP, checkout successVariation by segment
business gateConversion Rate/Cancellation RateStatistical significance compared to base
Promotion PolicyStep 10->25->50->100Observation time for each step

When it comes to gate design, organizational consensus is more important than deployment tools. Automation only makes sense if you agree in advance which indicators will stop immediately if they fail.

Code example 1: Gate decision function

type GateInput = {
  errorRateDiffPct: number;
  latencyP95DiffMs: number;
  conversionDiffPct: number;
};

export function evaluateCanaryGate(input: GateInput) {
  if (input.errorRateDiffPct > 0.5) return "ROLLBACK";
  if (input.latencyP95DiffMs > 120) return "HOLD";
  if (input.conversionDiffPct < -1.0) return "ROLLBACK";
  return "PROMOTE";
}

Code example 2: Argo Rollouts metric gate example

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: canary-gate
spec:
  metrics:
    - name: error-rate
      successCondition: result < 1.5
      provider:
        prometheus:
          query: sum(rate(http_requests_total{status=~"5.."}[5m]))
    - name: p95-latency
      successCondition: result < 700
      provider:
        prometheus:
          query: histogram_quantile(0.95, sum(rate(http_request_duration_ms_bucket[5m])) by (le))

Architecture flow

Mermaid diagram rendering...

Tradeoffs

  • Strict gate standards are safe, but distribution speed may be slow.
  • Automatic promotion reduces operational burden, but there is a risk of misjudgment if indicator reliability is low.
  • Including business indicators makes quality judgments more accurate, but increases analysis delay.

Cleanup

The value of Canary lies not in the small exposure itself but in the automation of verification. By designing tiered metric gates and codifying promotion/suspend rules, you can ensure consistent deployment quality.

Image source

  • Cover: source link
  • License: CC BY-SA 3.0 / Author: Abigor
  • Note: After downloading the free license image from Wikimedia Commons, it was optimized to JPG at 1600px.

Comments