4 min read

Part 8. Agent architecture: Planner/Executor, state machine, task queue

Operable agents are state-based systems, not chains. Planner/Executor separation, queues, guardrails, and recovery strategies are organized from a practical perspective.

Series: Prompt Engineering이 아니라 System Engineering이 중요한 이유

12편 구성. 현재 8편을 보고 있습니다.

Creating an agent demo is relatively easy. When the user makes a request, the model creates a plan, calls the tool, and returns the results. The problem is operation. As requests become longer, tools become slower, and failures repeat, simple chain structures quickly break down. An operational agent should be designed as a “state-based workflow system” rather than a “prompt chain”.

Based on version

  • Node.js 20 LTS
  • TypeScript 5.8.x
  • Next.js 16.1.x
  • OpenAI API (Responses API, based on 2026-03 document)
  • PostgreSQL 15
  • Redis 7

Raise a problem

Typical patterns of agent failure in practice are as follows:

  • The planner repeats replanning at every step, causing token costs and delays to rapidly increase.
  • Tool call failure is handled only by model re-invocation, resulting in a loop without cause.
  • In long-term tasks, when the process is restarted, the state is lost and the task is executed redundantly.
  • When concurrent requests increase, agent execution occupies API workers, causing overall service delay.

Practical Example A: Onboarding Automation Agent

Agents handling new customer account setup, authorization, and initial data validation experienced increased timeouts as external APIs slowed down. Requests that were executed synchronously in a chain structure occupied workers for a long time, and eventually even the general API was affected.

Practical example B: Data report generation agent

For each user request, Planner regenerated the entire plan. Even though it was the same work template, costs were unnecessarily high as it was re-planned every time, and some steps were performed redundantly, resulting in inconsistent results.

Key concepts

An operational Agent architecture separates at least four elements:

  1. Planner: Decompose goals into steps (tasks)
  2. Executor: Executes each step under policy
  3. State Machine: Management of phase transitions and recovery conditions
  4. Queue/Scheduler: Asynchronous execution, concurrency control, retry control| component | responsibility | Impact of failure | Design Points | | --- | --- | --- | --- | | Planner | Goal -> Step Plan | Spreading Bad Plans | Plan Template, Plan Verification | | Executor | Call tool/collect results | step failure, loop | Tool permissions, timeout, retry budget | | State Store | Perpetuating Progress | Duplicate execution/loss | Idempotent key, checkpoint | | Queue | Work distribution/back pressure | System wide delay | priority, concurrency cap | | Policy Guard | Block dangerous execution | Security/Quality Incidents | allow-list, approval flow |
Mermaid diagram rendering...

Practical pattern

Pattern 1: Planner/Executor Separation + Planning Template

If the Planner freely generates data each time, quality deviation and cost fluctuations increase. With templates for each domain, Planner is stable in its method of focusing on template selection and parameter binding.

type PlanStep = {
  id: string;
  tool: "fetchProfile" | "validatePolicy" | "createTicket" | "notifyUser";
  requiresApproval: boolean;
  timeoutMs: number;
};

type Plan = {
  planId: string;
  templateId: "onboarding_v1" | "refund_v2";
  steps: PlanStep[];
};

export function buildPlan(taskType: string): Plan {
  if (taskType === "onboarding") {
    return {
      planId: crypto.randomUUID(),
      templateId: "onboarding_v1",
      steps: [
        { id: "1", tool: "fetchProfile", requiresApproval: false, timeoutMs: 1000 },
        { id: "2", tool: "validatePolicy", requiresApproval: false, timeoutMs: 800 },
        { id: "3", tool: "createTicket", requiresApproval: true, timeoutMs: 1200 },
        { id: "4", tool: "notifyUser", requiresApproval: false, timeoutMs: 600 },
      ],
    };
  }

  throw new Error("unsupported task");
}

Operating points:

  • Changes to plan templates are managed as subject to code review/testing.
  • Free-form replan only allows limited recovery from failure.
  • Specify step-by-step timeouts and approval requirements in the plan.

Pattern 2: Stateful + idempotent execution (Safe At-least-once rather than Exactly-once)

In a distributed environment, ensuring exactly once execution is expensive. Instead, it is designed to be idempotent based on the premise of “executing at least once” to make it safe.

type AgentState = {
  jobId: string;
  currentStepId: string | null;
  completedSteps: string[];
  retryCount: number;
  idempotencyKey: string;
  status: "PLANNED" | "EXECUTING" | "FAILED" | "COMPLETED";
};

export async function executeStep(state: AgentState, step: PlanStep) {
  const dedup = await stateStore.findResult(state.idempotencyKey, step.id);
  if (dedup) return dedup;

  const result = await runTool(step.tool, { timeoutMs: step.timeoutMs });
  await stateStore.saveResult(state.idempotencyKey, step.id, result);
  await stateStore.markStepCompleted(state.jobId, step.id);

  return result;
}

Operating points:

  • Store checkpoints in step units and continue processing upon restart.
  • The idempotent key is composed of jobId + stepId + tenantId.
  • In case of reprocessing failure, re-execution of the completed step is prohibited.

Pattern 3: Separate task queues and human-in-the-loop

Critical steps should be sent to an approval queue instead of running automatically. This is necessary not only for security but also for separation of operational responsibilities.

# 위험 단계 승인 큐 투입
./ops/agent/enqueue-review.sh \
  --job-id job_20260303_8821 \
  --step-id 3 \
  --reason "requiresApproval=true"

# 실행 워커 동시성 상한
./ops/agent/worker-start.sh \
  --queue agent_exec \
  --concurrency 20 \
  --max-retry 2 \
  --dead-letter agent_dlq
{
  "job_id": "job_20260303_8821",
  "state": "QUEUED_REVIEW",
  "plan_template": "onboarding_v1",
  "current_step": "createTicket",
  "risk_level": "medium",
  "requires_human_approval": true,
  "approved_by": null
}

Operating points:

  • Separate the auto-execution queue and approval queue to clarify responsibility and authority.
  • Prevent infinite retries by specifying DLQ (dead-letter queue) standards.
  • Manage bottlenecks by including approval SLA as an operational indicator.

Failure cases/anti-patterns### Failure scenario: “Agent infinite loop and queue backlog”

Situation:

  • Planner over-applied the “replan if tool fails” rule.
  • Even though an external API permission error (fixing failure) occurred, the replan -> execute loop continued.
  • Within 30 minutes, the queue length increased 10 times, and the delay in processing new requests increased 4 times.

Detection procedure:

  1. agent_replan_rate surge warning
  2. Check tool_error_code=PERMISSION_DENIED repeat pattern
  3. Detection of multiple state transition cycles (NEEDS_REPLAN <-> PLANNED) in the same jobId

Mitigation Procedures:

  1. Immediately classify permission errors as non-retriable
  2. Apply an upper limit on the number of replans (e.g. max 1)
  3. Force problem jobs to be moved to the review queue

Recovery Procedure:

  1. Add cyclic transition detection rules to the state machine
  2. Specify “human escalation in case of authorization failure” in the plan template
  3. Add replan_loop_count indicator to operational dashboard

Representative antipatterns

  • Design that combines Planner and Executor into one prompt call
  • Maintain progress only with memory objects without state persistence
  • Handle all tool failures as model retries
  • Risk levels are automatically executed in the same queue.

Checklist

  • Are the Planner and Executor separated at the logic/process level?
  • Are state machine transitions specified and is cycle/infinite loop detection possible?
  • Are step-level checkpoints and idempotent execution guaranteed?
  • Are non-retriable errors (permissions/policies/schema) excluded from retry?
  • Are the autorun queue and human-review queue separate?
  • Are there DLQ standards and recovery runbooks?
  • Are replan_rate, queue_depth, approval_sla monitored as operational indicators?

Summary

The key to making an agent operational is not model capabilities but system control. Having Planner/Executor separation, state machines, checkpoints, queues, and approval flows can help localize failures and reduce recovery time. Conversely, if a chain-based implementation is put into operation as is, problems arise first in reliability, not quality.

Next episode previewThe next part deals with productization. We connect failed UX, expectation management, human-in-the-loop experience design, and operational governance to explain how to turn LLM features that are “technically correct but fail for users” into a product.

Series navigation

Comments