Part 7. Context Engineering: RAG, Memory, Recency, Multi-Tenancy
LLM quality is more sensitive to the context path than the model. We summarize how to design RAG, memory, freshness, and tenant boundaries from a system perspective.
Series: Prompt Engineering이 아니라 System Engineering이 중요한 이유
총 12편 구성. 현재 7편을 보고 있습니다.
- 01Part 1. Prompt is an interface: Revisiting system boundaries and contracts
- 02Part 2. Quality comes from the evaluation loop, not from prompts
- 03Part 3. Reliability Design: Retry, Timeout, Fallback, Circuit Breaker
- 04Part 4. Cost Design: Cache, Batching, Routing, Token Budget
- 05Part 5. Security Design: Prompt Injection, Data Leak, Policy Guard
- 06Part 6. Observability Design: Trace, Span, Log Schema, Regression Detection
- 07Part 7. Context Engineering: RAG, Memory, Recency, Multi-TenancyCURRENT
- 08Part 8. Agent architecture: Planner/Executor, state machine, task queue
- 09Part 9. Productization: Failure UX, Human-in-the-loop, Operational Governance
- 10Part 10. Change Management: Prompt Changes vs System Changes, Experiments and Rollbacks
- 11Part 11. Reference Architecture: End-to-End Operational Design
- 12Part 12. Organization/Process: Operational Maturity Model and Roadmap
Many teams attempt to solve LLM quality issues through model selection. However, the reality we often see in operations is different. Even for the same model, the results vary greatly depending on which context, in which order, and which filters are used. In other words, a significant part of quality is determined by “Context Engineering” rather than “Model Engineering”.
Based on version
- Node.js 20 LTS
- TypeScript 5.8.x
- Next.js 16.1.x
- OpenAI API (Responses API, based on 2026-03 document)
- PostgreSQL 15
- Redis 7
Raise a problem
In practice, the patterns that appear when context design is weak are as follows:
- Older documents are searched before the latest policies, resulting in incorrect answers.
- Other customer data is exposed due to missing filters in multi-tenant searches.
- Conversation memories continue to accumulate, increasing token costs and blurring the focus of answers.
- By simply increasing the retrieval top-k, a context with low relevance and a lot of noise is injected.
Practical example A: B2B support portal
The operations team updates policy documents every week, but index reflection was delayed by one day. The chatbot responded based on the “old version refund policy,” and CS correction work exploded. The essence was not a model performance issue, but a freshness pipeline issue.
Practical example B: Multi-tenant operations assistant
When the tenant filter was applied only after retrieval (post-filter), other tenant documents were already mixed into the top candidates during the search stage. Although some of it was removed in post-processing, omissions occurred during the model prompt combination process, leading to data confusion.
Key concepts
Context Engineering is not a simple RAG implementation, but a “knowledge supply chain” design. At a minimum, you should be able to answer the following four questions:
- What data will be accepted as context? (Source/Authority/Reliability)
- What data takes priority? (recency/domain weight/accuracy)
- How much to put? (Token Budget/Summary/Compression)
- When to retire/renew? (TTL/Reindex/Memory Clear)| Components | responsibility | representative failure | Essential Controls | | --- | --- | --- | --- | | Retriever | Browse candidate documents | Decrease relevance | hybrid search, score threshold | | Ranker | Document Prioritization | Ignoring recency | recency boost, policy weight | | Context Builder | Prompt combination | noise injection | dedup, chunk cap, compression | | Memory Store | Maintain session context | Contextual contamination | TTL, scope(tenant/user/session) | | Freshness Pipeline | index update | Old version answer | CDC, reindex SLA |
Practical pattern
Pattern 1: Freshness-Aware Ranking
To get an accurate answer, you shouldn't just look at relevance. Time-sensitive domains such as policy/pricing/operational procedures should combine recency scores.
type Candidate = {
id: string;
semanticScore: number; // 0..1
lexicalScore: number; // 0..1
updatedAt: string;
sourceWeight: number; // 신뢰도 0..1
};
function recencyScore(updatedAt: string): number {
const ageHours = (Date.now() - new Date(updatedAt).getTime()) / 3_600_000;
if (ageHours <= 24) return 1.0;
if (ageHours <= 24 * 7) return 0.8;
if (ageHours <= 24 * 30) return 0.5;
return 0.2;
}
export function rank(candidates: Candidate[]) {
return [...candidates].sort((a, b) => {
const scoreA = 0.45 * a.semanticScore + 0.2 * a.lexicalScore + 0.25 * recencyScore(a.updatedAt) + 0.1 * a.sourceWeight;
const scoreB = 0.45 * b.semanticScore + 0.2 * b.lexicalScore + 0.25 * recencyScore(b.updatedAt) + 0.1 * b.sourceWeight;
return scoreB - scoreA;
});
}
Operating points:
- Different weights are given to recency for each domain (policy/technical document/notice).
- Include
updatedAtin search results and display it along with the basis for the answer. - Operate freshness regression (increasing the citation rate of old documents) as a quality alert.
Pattern 2: Memory Scope Separation and TTL
More memory is not always better. If you do not separate user/session/task scopes, both incorrect answers and costs increase.
type MemoryRecord = {
key: string;
tenantId: string;
userId: string;
sessionId: string;
taskType: "support" | "analysis" | "admin";
content: string;
createdAt: string;
expiresAt: string;
};
export function memoryKey(input: {
tenantId: string;
userId: string;
sessionId: string;
taskType: string;
}) {
return `${input.tenantId}:${input.userId}:${input.sessionId}:${input.taskType}`;
}
export function shouldKeepMemory(taskType: string) {
if (taskType === "support") return { ttlMin: 30 };
if (taskType === "analysis") return { ttlMin: 180 };
return { ttlMin: 15 };
}
Operating points:
- Memory sharing beyond scope starts with a default deny.
- Adjust the appropriate retention time by considering the TTL expiration rate and regeneration cost.
- Be sure to pass a safety filter (sensitive information, directive contamination) before memory injection.
Pattern 3: Double Boundary in Multitenant Retrieval
Tenant boundaries must be applied both before and after retrieval.
{
"retrieval_policy": {
"tenant_filter_pre": true,
"tenant_filter_post": true,
"max_chunks": 12,
"min_score": 0.58,
"cross_tenant_allowed": false,
"cross_tenant_exception": []
},
"memory_policy": {
"scope": ["tenant", "user", "session"],
"default_ttl_min": 30,
"pii_redaction": "strict"
}
}
# 최신성 SLO 체크: 24시간 이내 갱신 문서 반영률
./ops/freshness/check-index-lag.sh \
--source policy_docs \
--target-index primary_vector \
--slo "p95<15m" \
--alert-channel platform-oncall
Operating points:
- Missing pre-filter is classified as a design defect and release is blocked immediately.
- Include index delay (
index_lag) as part of the quality SLO. - Fix tenant boundary tests as contract tests rather than integration tests.
Failure cases/anti-patterns### Failure scenario: “Incorrect answers increase rapidly after policy change”
Situation:
- The refund policy was changed at 10:00 AM, but the index rebuild was completed at 16:00.
- Meanwhile, the chatbot cited outdated policies, and hundreds of incorrect instructions occurred.
- The system did not recognize the error as there was no error.
Detection procedure:
stale_citation_ratiosurge detection- Analyzing the distribution of retrieval document
updatedAtfrom trace - Check the difference between source-of-truth DB and index version
Mitigation Procedures:
- Policy domain queries temporarily use the DB direct query path first.
- Block replies when citing old documents + switch to human-review
- Review of response history in the relevant time period and guidance on user correction
Recovery Procedure:
- Introducing near-real-time indexing for policy document domains
- Freshness weight and
max_agepolicy enforcement - Add “Confirm index reflection completion” step to the deployment checklist
Representative antipatterns
- Understand RAG only at the “vector search and paste” level
- Separate retrieval quality and freshness metrics from operational dashboards
- Retains session memory indefinitely, causing context pollution
- Designed to ensure boundaries with only post-filter in multi-tenancy
Checklist
- Does the retrieval design include relevance + recency + confidence weights?
- Is the tenant filter applied to both pre/post stages?
- Are the memory scope (tenant/user/session/task) and TTL specified?
- Is index delay (
index_lag) operated as a quality SLO? - Are there any stale citation detection alerts?
- Is the chunk number/duplicate/token upper limit enforced in the context builder?
- Is a human-review bypass route prepared in case of context contamination?
Summary
The key to Context Engineering is not “putting in a lot” but “putting in exactly”. To ensure stable operational quality, RAG, memory, freshness, and multi-tenancy must be designed as one data supply chain rather than optimizing each. Even if the model is the same, if the context pipeline is different, the results will be completely different.
Next episode previewThe next part covers Agent architecture. We explain how to solve the “demo works but breaks in operation” agent problem by combining Planner/Executor separation, state machines, task queues, and tool call guardrails. In particular, practical patterns are organized with a focus on long-term work and recovery from failure.
Reference link
- OpenAI Developer Docs - Responses API
- OpenTelemetry 공식 문서
- Martin Fowler - Circuit Breaker
- 블로그: LLM Agent Tool Guardrail 설계
Series navigation
- Previous post: Part 6. 관측성 설계: Trace, Span, 로그 스키마, 회귀 감지
- Next post: Part 8. Agent 아키텍처: Planner/Executor, 상태 머신, 작업 큐