vector search evaluation indicator design

Introduction

Vector search quality is not determined solely by embedding model selection. In practice, index parameters, reranking, and evaluation set quality largely determine the results, but this part is often omitted. This article covers how to continuously improve search quality by linking offline/online metrics.

벡터 검색 평가 지표 설계 커버 — Wikimedia Commons 기반 무료 이미지

Problem definition

Even though search accuracy is low, there are many cases where the cause cannot be identified.

The evaluation set does not reflect actual user queries, so the score and perceived quality are different.
Noise documents increase because only recall is tracked and precision is not looked at.
Since the index parameter change experiment was not recorded, regression is repeated.

Evaluation design requires multi-level indicators rather than single indicators. Recall@k, nDCG, response delay, and cost must be viewed together.

Key concepts

perspective	Design criteria	Verification points
offline accuracy	Recall@k + nDCG	Trend based on test set
Online Quality	Post-search click/completion rate	Real user experience
Performance	query latency + QPS	SLO compliance rate
cost	Token/Embedding Cost	Cost per request

Indicators may conflict with each other. Increasing quality can increase costs, and increasing speed can decrease accuracy, so goal priorities must be clear.

Code example 1: Recall@k calculation

export function recallAtK(retrieved: string[], relevant: Set<string>, k: number) {
  const topK = retrieved.slice(0, k);
  const hit = topK.filter((id) => relevant.has(id)).length;
  return hit / Math.max(relevant.size, 1);
}

export function ndcgAtK(gains: number[], k: number) {
  const top = gains.slice(0, k);
  const dcg = top.reduce((acc, gain, i) => acc + gain / Math.log2(i + 2), 0);
  const ideal = [...gains].sort((a, b) => b - a).slice(0, k);
  const idcg = ideal.reduce((acc, gain, i) => acc + gain / Math.log2(i + 2), 0);
  return idcg === 0 ? 0 : dcg / idcg;
}

Code Example 2: Experimental Registry Record

INSERT INTO vector_search_experiments (
  experiment_id,
  embedding_model,
  index_type,
  top_k,
  recall_at_10,
  ndcg_at_10,
  latency_p95_ms,
  cost_per_1k_queries
) VALUES (
  'exp_20260303_hnsw_m64',
  'text-embedding-3-large',
  'hnsw',
  20,
  0.84,
  0.79,
  145,
  0.63
);

Architecture flow

Mermaid diagram rendering...

Tradeoffs

Increasing top-k increases recall, but delays and costs also increase.
Adding reranking improves quality, but increases pipeline complexity.
Although it costs money to maintain the evaluation set, quality regression can be detected early.

Cleanup

Vector search evaluation is a system optimization problem, not a model comparison. By tracking offline accuracy and online indicators together and managing experiment history, quality can be steadily improved.

Image source

Cover: source link
License: Public domain / Author: Fibonacci.
Note: After downloading the free license image from Wikimedia Commons, it was optimized to JPG at 1600px.