Kubernetes HPA metrics design

Introduction

When HPA is first applied, it often starts based on CPU. The problem is that the user's perceived quality is not always proportional to the CPU. Bottlenecks that are not captured by the CPU, such as queue backlog, external API delay, and DB connection saturation, cause real problems.

This article summarizes the process of deciding “what to use as a scale standard” from a service perspective.

Kubernetes HPA 메트릭 설계 커버 — Wikimedia Commons 기반 무료 이미지

Problem definition

Typical issues that arise when operating only CPU-based HPA:

The request volume has increased, but scale-out is slow due to the I/O wait-centered bottleneck.
When traffic surges, queue backlog explodes first, but the CPU appears stable, so response is delayed.
Even after scale-out, p95 latency does not improve, so only overscaling occurs.
The cost is unstable as min/max replica values are determined only empirically.

The key is to adopt “metrics that reflect user influence first” as the scale trigger.

Key concepts

metric type	Example	suitable service
Resource Metrics	CPU, Memory	Simple stateless API
Custom Metrics	request_per_pod, queue_depth	Background task processing
External Metrics	Kafka lag, SQS depth	Event consumption service
SLO-based	p95 latency, error budget burn	User-sensitive service

In practice, the combination of “main metric + protection metric” is safer than a single metric. Example: Expand queue depth, but stop promotion when the error rate spikes.

Code example 1: HPA(v2) with custom metric

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api
  minReplicas: 3
  maxReplicas: 30
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
        - type: Percent
          value: 100
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 20
          periodSeconds: 60
  metrics:
    - type: Pods
      pods:
        metric:
          name: http_requests_per_second
        target:
          type: AverageValue
          averageValue: "40"

Code Example 2: Metric Quality Verification Query (PromQL)

# 스케일 기준 후보: pod당 초당 요청
sum(rate(http_server_requests_total{job="api",status!~"5.."}[2m]))
/
count(kube_pod_status_ready{namespace="prod",condition="true",pod=~"api-.*"})

# 보호 메트릭: p95 latency
histogram_quantile(
  0.95,
  sum(rate(http_server_request_duration_seconds_bucket{job="api"}[5m])) by (le)
)

Architecture flow

Mermaid diagram rendering...

Tradeoffs

Custom metric-based scales have high accuracy, but increase operational complexity.
If scale-up is sensitive, rapid response is good, but cost volatility increases.
If scale-down is conservative, costs increase but stability increases.

Cleanup

HPA is a service policy, not a Kubernetes feature. Instead of just looking at the CPU, user impact indicators should be used as the standard to lead to actual quality improvement. It is safest to first verify the metric quality in one service and then expand to the standard.

Image source

Cover: source link
License: Public domain / Author: Unknown author Unknown author
Note: After downloading the free license image from Wikimedia Commons, it was optimized to JPG at 1600px.