4 min read

Part 1. Prompt is an interface: Revisiting system boundaries and contracts

From a system perspective, we summarize that the essence of the LLM function is not the prompt statement itself but the boundary, contract, state, and failure handling.

Series: Prompt Engineering이 아니라 System Engineering이 중요한 이유

12편 구성. 현재 1편을 보고 있습니다.

When adding the LLM function to a product, the first thing that catches your eye is the prompt. In fact, in the demo stage, changing one or two lines of the prompt can significantly change the perceived quality. The problem is that it's easy to believe that this experience also applies in operational environments. Most failures in an operational environment begin with boundary design and lack of contract rather than statement choice.

Based on version

  • Node.js 20 LTS
  • TypeScript 5.8.x
  • Next.js 16.1.x
  • OpenAI API (Responses API, based on 2026-03 document)
  • PostgreSQL 15
  • Redis 7

Raise a problem

Failure patterns frequently seen in the field are as follows.

  • When the prompt was modified, the response format changed, causing a chain of failures in the downstream parser.
  • Even if the model is normal, the external tool call times out and the overall request SLA is broken.
  • The same user request is executed twice with retries, causing side effects such as payment/reservation to occur repeatedly.
  • Only “LLM call failure” remains in the operation log, and it is impossible to track at what stage and which policy was broken.

The common cause here is not the model's lack of intelligence. The problem is that “input-output contract, state, and failure policy” are not defined by the system.

Practical example A: Automating customer center responses

Let's assume that LLM returns a draft answer in JSON format to the customer center, and then the rules engine performs forbidden word checking and tone correction. Forcing formatting with only a prompt can result in missing fields due to model updates or context changes. At this time, the forbidden word checker does not find answer.body, so the entire processing pipeline is stopped.

Practical Example B: SQL Analysis Agent for Operators

When the agent, which takes operator questions, generates and executes SQL, tuned the prompt to "feel free", the DELETE query passed to the tool execution phase without a safety guard. The essential problem is that there was no policy gate, not a prompt.

Key concepts

The key is to treat LLM not as a “smart function” but as a “stochastic component”. The system is then designed with the following principles:1. The prompt is the medium through which the contract is conveyed. It is not the contract itself. 2. Contracts must be verifiable through code and schema. 3. Side effects are controlled by state machines and idempotency. 4. Failure is not an exception but a branch of the normal flow.

layerresponsibilityFrequency of changeTest method
Prompt LayerCommunicate intent, control styleHighOffline sample evaluation
Contract LayerInput/output type/constraintsmiddleSchema validation, contract testing
Orchestration LayerState transitions, retries, timeoutsmiddleSimulation, Fault Injection Testing
Policy LayerPermissions, data access, tool execution restrictionsLow (strict)Security testing, regression testing
Observability LayerTracking, Cost, Quality Signal CollectionmiddleIndicator Alarm Verification

Architectural Perspective

Mermaid diagram rendering...

In this structure, the prompt is delivered to Model Runtime, but the reliability of the system is determined by Contract Validator, Policy Guard, and State Store.

Practical pattern

Pattern 1: Typed Contract + Strict Parser

Simply writing “Answer in JSON” to the prompt isn’t enough. The system must treat schema validation failure as a level 1 event.

import { z } from "zod";

const AnswerSchema = z.object({
  answer: z.string().min(20),
  confidence: z.number().min(0).max(1),
  citations: z.array(z.string()).max(5),
  safe_to_send: z.boolean(),
});

type Answer = z.infer<typeof AnswerSchema>;

export function parseModelOutput(raw: string): Answer {
  const parsed = JSON.parse(raw);
  return AnswerSchema.parse(parsed);
}

Operating points:

  • The parsing failure rate is separated from the model quality indicator and operated as a separate alarm.
  • Specifies the selection rules for “re-ask” and “fallback template” in case of failure.
  • The original text that fails to parse is sampled and stored after masking sensitive information.

Pattern 2: Orchestrator state machine + idempotent key

LLM calls are often combined with external tools. At this time, state transition-based design is more stable than simple function chain.

type Phase = "RECEIVED" | "VALIDATED" | "MODEL_CALLED" | "TOOL_EXECUTED" | "COMPLETED" | "FAILED";

type JobState = {
  requestId: string;
  phase: Phase;
  retries: number;
  idempotencyKey: string;
};

export async function runJob(state: JobState) {
  if (await alreadyCompleted(state.idempotencyKey)) {
    return { status: "deduped" };
  }

  const validated = await validateInput(state);
  const modelResult = await callModelWithTimeout(validated, 3000);
  const safeResult = await enforcePolicies(modelResult);
  const toolResult = await executeToolSafely(safeResult);

  await markCompleted(state.idempotencyKey, toolResult);
  return { status: "ok", toolResult };
}

Operating points:

  • idempotencyKey is created with user input hash + business key.
  • Collect latency/error for each phase separately to find bottleneck section.
  • Tool execution is separated into a separate queue to isolate it from model delays.

Failure cases/anti-patterns

Failure Scenario: "Response parsing explosion after prompt hotfix"

Situation:- The prompt was urgently modified at 20:10 on Friday, and the quality of answers improved, but the JSON key name changed from citations to references.

  • From 20:14, the parser failure rate jumped from 2% to 37%.
  • The retry logic amplified the failure by using the same prompt.

Detection:

  1. contract_parse_error_rate Alarm occurred
  2. Confirm concentration on MODEL_CALLED -> VALIDATION_FAILED section in trace
  3. Identify change points by mapping deployment SHA to prompt version

Mitigation:

  1. Temporary mapping of references -> citations in the parser compatibility layer.
  2. Disable retry and increase fallback response rate
  3. Immediately route 20% of new traffic to the previous prompt version

Recovery:

  1. Add prompt/agreement version synchronization rule
  2. Apply a deployment gate that only allows hotfixes without contract changes
  3. Add “change key name” scenario to regression test

Representative antipatterns

  • Approach to ending the problem with “one good prompt”
  • Connect natural language results directly to tool execution without a contract
  • Cover failure with retry only and omit classification of cause
  • Treat prompt changes more lightly than code changes

Checklist

This is an operational checklist that can be applied tomorrow.

  • Are the prompt version and contract version stored and tracked separately?
  • Does model output always have to pass schema verification to proceed to the next step?
  • Are schema failures, policy failures, and tool failures collected as separate indicators?
  • Are the permission context and allow-list enforced when executing tools?
  • Do side-effect operations (payment, reservation, shipping) use idempotent keys?
  • Is the retry policy designed as a “quarterly policy” rather than “repeat the same request”?
  • In case of failure, are the prompt rollback path and the system rollback path separated?

Summary

Prompt engineering is important, but it is not the key to operational stability. In operations, the stakes are contracts, status, policy, and observability. Prompts should be treated as one layer of the system, and if the remaining layers are not designed, quality improvement will remain haphazard.

Next episode previewThe next episode deals with “Quality comes from the evaluation system, not the prompt.” It explains how offline benchmarks, online indicators, and human feedback loops should be integrated into one system. In particular, we summarize why the paradox of “accuracy has increased but CS complaints increase” occurs, and how to design evaluation indicators to detect this, using examples.

Series navigation

Comments