2 min read

vector search evaluation indicator design

Standard for managing search quality by interpreting Recall@K, MRR, and NDCG according to the service context

vector search evaluation indicator design thumbnail

Introduction

Vector search quality is not determined solely by embedding model selection. In practice, index parameters, reranking, and evaluation set quality largely determine the results, but this part is often omitted. This article covers how to continuously improve search quality by linking offline/online metrics.

벡터 검색 평가 지표 설계 커버
Wikimedia Commons 기반 무료 이미지

Problem definition

Even though search accuracy is low, there are many cases where the cause cannot be identified.

  • The evaluation set does not reflect actual user queries, so the score and perceived quality are different.
  • Noise documents increase because only recall is tracked and precision is not looked at.
  • Since the index parameter change experiment was not recorded, regression is repeated.

Evaluation design requires multi-level indicators rather than single indicators. Recall@k, nDCG, response delay, and cost must be viewed together.

Key concepts

perspectiveDesign criteriaVerification points
offline accuracyRecall@k + nDCGTrend based on test set
Online QualityPost-search click/completion rateReal user experience
Performancequery latency + QPSSLO compliance rate
costToken/Embedding CostCost per request

Indicators may conflict with each other. Increasing quality can increase costs, and increasing speed can decrease accuracy, so goal priorities must be clear.

Code example 1: Recall@k calculation

export function recallAtK(retrieved: string[], relevant: Set<string>, k: number) {
  const topK = retrieved.slice(0, k);
  const hit = topK.filter((id) => relevant.has(id)).length;
  return hit / Math.max(relevant.size, 1);
}

export function ndcgAtK(gains: number[], k: number) {
  const top = gains.slice(0, k);
  const dcg = top.reduce((acc, gain, i) => acc + gain / Math.log2(i + 2), 0);
  const ideal = [...gains].sort((a, b) => b - a).slice(0, k);
  const idcg = ideal.reduce((acc, gain, i) => acc + gain / Math.log2(i + 2), 0);
  return idcg === 0 ? 0 : dcg / idcg;
}

Code Example 2: Experimental Registry Record

INSERT INTO vector_search_experiments (
  experiment_id,
  embedding_model,
  index_type,
  top_k,
  recall_at_10,
  ndcg_at_10,
  latency_p95_ms,
  cost_per_1k_queries
) VALUES (
  'exp_20260303_hnsw_m64',
  'text-embedding-3-large',
  'hnsw',
  20,
  0.84,
  0.79,
  145,
  0.63
);

Architecture flow

Mermaid diagram rendering...

Tradeoffs

  • Increasing top-k increases recall, but delays and costs also increase.
  • Adding reranking improves quality, but increases pipeline complexity.
  • Although it costs money to maintain the evaluation set, quality regression can be detected early.

Cleanup

Vector search evaluation is a system optimization problem, not a model comparison. By tracking offline accuracy and online indicators together and managing experiment history, quality can be steadily improved.

Image source

  • Cover: source link
  • License: Public domain / Author: Fibonacci.
  • Note: After downloading the free license image from Wikimedia Commons, it was optimized to JPG at 1600px.

Comments