4 min read

Part 7. Context Engineering: RAG, Memory, Recency, Multi-Tenancy

LLM quality is more sensitive to the context path than the model. We summarize how to design RAG, memory, freshness, and tenant boundaries from a system perspective.

Series: Prompt Engineering이 아니라 System Engineering이 중요한 이유

12편 구성. 현재 7편을 보고 있습니다.

Many teams attempt to solve LLM quality issues through model selection. However, the reality we often see in operations is different. Even for the same model, the results vary greatly depending on which context, in which order, and which filters are used. In other words, a significant part of quality is determined by “Context Engineering” rather than “Model Engineering”.

Based on version

  • Node.js 20 LTS
  • TypeScript 5.8.x
  • Next.js 16.1.x
  • OpenAI API (Responses API, based on 2026-03 document)
  • PostgreSQL 15
  • Redis 7

Raise a problem

In practice, the patterns that appear when context design is weak are as follows:

  • Older documents are searched before the latest policies, resulting in incorrect answers.
  • Other customer data is exposed due to missing filters in multi-tenant searches.
  • Conversation memories continue to accumulate, increasing token costs and blurring the focus of answers.
  • By simply increasing the retrieval top-k, a context with low relevance and a lot of noise is injected.

Practical example A: B2B support portal

The operations team updates policy documents every week, but index reflection was delayed by one day. The chatbot responded based on the “old version refund policy,” and CS correction work exploded. The essence was not a model performance issue, but a freshness pipeline issue.

Practical example B: Multi-tenant operations assistant

When the tenant filter was applied only after retrieval (post-filter), other tenant documents were already mixed into the top candidates during the search stage. Although some of it was removed in post-processing, omissions occurred during the model prompt combination process, leading to data confusion.

Key concepts

Context Engineering is not a simple RAG implementation, but a “knowledge supply chain” design. At a minimum, you should be able to answer the following four questions:

  1. What data will be accepted as context? (Source/Authority/Reliability)
  2. What data takes priority? (recency/domain weight/accuracy)
  3. How much to put? (Token Budget/Summary/Compression)
  4. When to retire/renew? (TTL/Reindex/Memory Clear)| Components | responsibility | representative failure | Essential Controls | | --- | --- | --- | --- | | Retriever | Browse candidate documents | Decrease relevance | hybrid search, score threshold | | Ranker | Document Prioritization | Ignoring recency | recency boost, policy weight | | Context Builder | Prompt combination | noise injection | dedup, chunk cap, compression | | Memory Store | Maintain session context | Contextual contamination | TTL, scope(tenant/user/session) | | Freshness Pipeline | index update | Old version answer | CDC, reindex SLA |
Mermaid diagram rendering...

Practical pattern

Pattern 1: Freshness-Aware Ranking

To get an accurate answer, you shouldn't just look at relevance. Time-sensitive domains such as policy/pricing/operational procedures should combine recency scores.

type Candidate = {
  id: string;
  semanticScore: number; // 0..1
  lexicalScore: number;  // 0..1
  updatedAt: string;
  sourceWeight: number;  // 신뢰도 0..1
};

function recencyScore(updatedAt: string): number {
  const ageHours = (Date.now() - new Date(updatedAt).getTime()) / 3_600_000;
  if (ageHours <= 24) return 1.0;
  if (ageHours <= 24 * 7) return 0.8;
  if (ageHours <= 24 * 30) return 0.5;
  return 0.2;
}

export function rank(candidates: Candidate[]) {
  return [...candidates].sort((a, b) => {
    const scoreA = 0.45 * a.semanticScore + 0.2 * a.lexicalScore + 0.25 * recencyScore(a.updatedAt) + 0.1 * a.sourceWeight;
    const scoreB = 0.45 * b.semanticScore + 0.2 * b.lexicalScore + 0.25 * recencyScore(b.updatedAt) + 0.1 * b.sourceWeight;
    return scoreB - scoreA;
  });
}

Operating points:

  • Different weights are given to recency for each domain (policy/technical document/notice).
  • Include updatedAt in search results and display it along with the basis for the answer.
  • Operate freshness regression (increasing the citation rate of old documents) as a quality alert.

Pattern 2: Memory Scope Separation and TTL

More memory is not always better. If you do not separate user/session/task scopes, both incorrect answers and costs increase.

type MemoryRecord = {
  key: string;
  tenantId: string;
  userId: string;
  sessionId: string;
  taskType: "support" | "analysis" | "admin";
  content: string;
  createdAt: string;
  expiresAt: string;
};

export function memoryKey(input: {
  tenantId: string;
  userId: string;
  sessionId: string;
  taskType: string;
}) {
  return `${input.tenantId}:${input.userId}:${input.sessionId}:${input.taskType}`;
}

export function shouldKeepMemory(taskType: string) {
  if (taskType === "support") return { ttlMin: 30 };
  if (taskType === "analysis") return { ttlMin: 180 };
  return { ttlMin: 15 };
}

Operating points:

  • Memory sharing beyond scope starts with a default deny.
  • Adjust the appropriate retention time by considering the TTL expiration rate and regeneration cost.
  • Be sure to pass a safety filter (sensitive information, directive contamination) before memory injection.

Pattern 3: Double Boundary in Multitenant Retrieval

Tenant boundaries must be applied both before and after retrieval.

{
  "retrieval_policy": {
    "tenant_filter_pre": true,
    "tenant_filter_post": true,
    "max_chunks": 12,
    "min_score": 0.58,
    "cross_tenant_allowed": false,
    "cross_tenant_exception": []
  },
  "memory_policy": {
    "scope": ["tenant", "user", "session"],
    "default_ttl_min": 30,
    "pii_redaction": "strict"
  }
}
# 최신성 SLO 체크: 24시간 이내 갱신 문서 반영률
./ops/freshness/check-index-lag.sh \
  --source policy_docs \
  --target-index primary_vector \
  --slo "p95<15m" \
  --alert-channel platform-oncall

Operating points:

  • Missing pre-filter is classified as a design defect and release is blocked immediately.
  • Include index delay (index_lag) as part of the quality SLO.
  • Fix tenant boundary tests as contract tests rather than integration tests.

Failure cases/anti-patterns### Failure scenario: “Incorrect answers increase rapidly after policy change”

Situation:

  • The refund policy was changed at 10:00 AM, but the index rebuild was completed at 16:00.
  • Meanwhile, the chatbot cited outdated policies, and hundreds of incorrect instructions occurred.
  • The system did not recognize the error as there was no error.

Detection procedure:

  1. stale_citation_ratio surge detection
  2. Analyzing the distribution of retrieval document updatedAt from trace
  3. Check the difference between source-of-truth DB and index version

Mitigation Procedures:

  1. Policy domain queries temporarily use the DB direct query path first.
  2. Block replies when citing old documents + switch to human-review
  3. Review of response history in the relevant time period and guidance on user correction

Recovery Procedure:

  1. Introducing near-real-time indexing for policy document domains
  2. Freshness weight and max_age policy enforcement
  3. Add “Confirm index reflection completion” step to the deployment checklist

Representative antipatterns

  • Understand RAG only at the “vector search and paste” level
  • Separate retrieval quality and freshness metrics from operational dashboards
  • Retains session memory indefinitely, causing context pollution
  • Designed to ensure boundaries with only post-filter in multi-tenancy

Checklist

  • Does the retrieval design include relevance + recency + confidence weights?
  • Is the tenant filter applied to both pre/post stages?
  • Are the memory scope (tenant/user/session/task) and TTL specified?
  • Is index delay (index_lag) operated as a quality SLO?
  • Are there any stale citation detection alerts?
  • Is the chunk number/duplicate/token upper limit enforced in the context builder?
  • Is a human-review bypass route prepared in case of context contamination?

Summary

The key to Context Engineering is not “putting in a lot” but “putting in exactly”. To ensure stable operational quality, RAG, memory, freshness, and multi-tenancy must be designed as one data supply chain rather than optimizing each. Even if the model is the same, if the context pipeline is different, the results will be completely different.

Next episode previewThe next part covers Agent architecture. We explain how to solve the “demo works but breaks in operation” agent problem by combining Planner/Executor separation, state machines, task queues, and tool call guardrails. In particular, practical patterns are organized with a focus on long-term work and recovery from failure.

Series navigation

Comments