2 min read

Kubernetes HPA metrics design

How to design autoscaling based on user experience indicators, moving away from CPU-centric scaling

Kubernetes HPA metrics design thumbnail

Introduction

When HPA is first applied, it often starts based on CPU. The problem is that the user's perceived quality is not always proportional to the CPU. Bottlenecks that are not captured by the CPU, such as queue backlog, external API delay, and DB connection saturation, cause real problems.

This article summarizes the process of deciding “what to use as a scale standard” from a service perspective.

Kubernetes HPA 메트릭 설계 커버
Wikimedia Commons 기반 무료 이미지

Problem definition

Typical issues that arise when operating only CPU-based HPA:

  • The request volume has increased, but scale-out is slow due to the I/O wait-centered bottleneck.
  • When traffic surges, queue backlog explodes first, but the CPU appears stable, so response is delayed.
  • Even after scale-out, p95 latency does not improve, so only overscaling occurs.
  • The cost is unstable as min/max replica values ​​are determined only empirically.

The key is to adopt “metrics that reflect user influence first” as the scale trigger.

Key concepts

metric typeExamplesuitable service
Resource MetricsCPU, MemorySimple stateless API
Custom Metricsrequest_per_pod, queue_depthBackground task processing
External MetricsKafka lag, SQS depthEvent consumption service
SLO-basedp95 latency, error budget burnUser-sensitive service

In practice, the combination of “main metric + protection metric” is safer than a single metric. Example: Expand queue depth, but stop promotion when the error rate spikes.

Code example 1: HPA(v2) with custom metric

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  minReplicas: 3
  maxReplicas: 30
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
        - type: Percent
          value: 100
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 20
          periodSeconds: 60
  metrics:
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: "40"

Code Example 2: Metric Quality Verification Query (PromQL)

# 스케일 기준 후보: pod당 초당 요청
sum(rate(http_server_requests_total{job="api",status!~"5.."}[2m]))
/
count(kube_pod_status_ready{namespace="prod",condition="true",pod=~"api-.*"})

# 보호 메트릭: p95 latency
histogram_quantile(
  0.95,
  sum(rate(http_server_request_duration_seconds_bucket{job="api"}[5m])) by (le)
)

Architecture flow

Mermaid diagram rendering...

Tradeoffs

  • Custom metric-based scales have high accuracy, but increase operational complexity.
  • If scale-up is sensitive, rapid response is good, but cost volatility increases.
  • If scale-down is conservative, costs increase but stability increases.

Cleanup

HPA is a service policy, not a Kubernetes feature. Instead of just looking at the CPU, user impact indicators should be used as the standard to lead to actual quality improvement. It is safest to first verify the metric quality in one service and then expand to the standard.

Image source

  • Cover: source link
  • License: Public domain / Author: Unknown author Unknown author
  • Note: After downloading the free license image from Wikimedia Commons, it was optimized to JPG at 1600px.

Comments