Kubernetes HPA metrics design
How to design autoscaling based on user experience indicators, moving away from CPU-centric scaling

Introduction
When HPA is first applied, it often starts based on CPU. The problem is that the user's perceived quality is not always proportional to the CPU. Bottlenecks that are not captured by the CPU, such as queue backlog, external API delay, and DB connection saturation, cause real problems.
This article summarizes the process of deciding “what to use as a scale standard” from a service perspective.

Problem definition
Typical issues that arise when operating only CPU-based HPA:
- The request volume has increased, but scale-out is slow due to the I/O wait-centered bottleneck.
- When traffic surges, queue backlog explodes first, but the CPU appears stable, so response is delayed.
- Even after scale-out, p95 latency does not improve, so only overscaling occurs.
- The cost is unstable as min/max replica values are determined only empirically.
The key is to adopt “metrics that reflect user influence first” as the scale trigger.
Key concepts
| metric type | Example | suitable service |
|---|---|---|
| Resource Metrics | CPU, Memory | Simple stateless API |
| Custom Metrics | request_per_pod, queue_depth | Background task processing |
| External Metrics | Kafka lag, SQS depth | Event consumption service |
| SLO-based | p95 latency, error budget burn | User-sensitive service |
In practice, the combination of “main metric + protection metric” is safer than a single metric. Example: Expand queue depth, but stop promotion when the error rate spikes.
Code example 1: HPA(v2) with custom metric
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api
minReplicas: 3
maxReplicas: 30
behavior:
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 100
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 20
periodSeconds: 60
metrics:
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "40"
Code Example 2: Metric Quality Verification Query (PromQL)
# 스케일 기준 후보: pod당 초당 요청
sum(rate(http_server_requests_total{job="api",status!~"5.."}[2m]))
/
count(kube_pod_status_ready{namespace="prod",condition="true",pod=~"api-.*"})
# 보호 메트릭: p95 latency
histogram_quantile(
0.95,
sum(rate(http_server_request_duration_seconds_bucket{job="api"}[5m])) by (le)
)
Architecture flow
Tradeoffs
- Custom metric-based scales have high accuracy, but increase operational complexity.
- If scale-up is sensitive, rapid response is good, but cost volatility increases.
- If scale-down is conservative, costs increase but stability increases.
Cleanup
HPA is a service policy, not a Kubernetes feature. Instead of just looking at the CPU, user impact indicators should be used as the standard to lead to actual quality improvement. It is safest to first verify the metric quality in one service and then expand to the standard.
Image source
- Cover: source link
- License: Public domain / Author: Unknown author Unknown author
- Note: After downloading the free license image from Wikimedia Commons, it was optimized to JPG at 1600px.