Part 12. Integrated reference architecture and final selection guide

Source: Pexels - Server racks on data center

Based on version

Java 21
Spring Boot 3.3.x
Spring Batch 5.2.x
Quartz 2.3.x
PostgreSQL 15
OpenSearch 2.x

1) Raise a problem

In practice, common failures begin with “attempting to solve all deployments with one tool,” rather than with the technology choice itself. On-time settlement, re-indexing, operator reprocessing, and event correction have different requirements. Therefore, there is no single correct answer. The answer is a combination.

The goals of this final episode are two-fold.

Quickly classify requirements and select an appropriate execution model.
Provide realistic standards that consider failure/performance/operational complexity together.

2) Summary of key concepts

Integrated Selection Matrix

Requirements	1st choice	2nd supplement	Remarks
Simple cycle task (instance 1-2)	`@Scheduled`	Idempotent key + execution history	Advantages for quick start
Complex Schedule/Calendar/Misfire	Quartz	JDBCJobStore + Dashboard	Strong operational control
Bulk data conversion/restart required	Spring Batch	Keyset Reader + Chunk Tuning	Select Standard
Operator Intervention Reprocessing	Manual Deployment API/UI	Permissions/Audit/Rollback	Securing control/reproducibility
Bulk Search Index	Spring Batch + OpenSearch Bulk	PIT + Search After	Search cluster limit reflection

Principle of separation of execution layers

Trigger layer: @Scheduled/Quartz/K8s CronJob
Processing Layer: Spring Batch Step/Chunk
Data layer: DB Keyset/Range + OpenSearch Bulk
Control layer: lock/leader election/idempotency
Operational Layer: Observability/Alerts/DLQ/Manual Reprocessing

Reference Architecture Diagram

Mermaid diagram rendering...

Source: Pexels - Security control room team

3) Code example

Example A: Unified Orchestrator Interface

public interface BatchOrchestrator {
    Long launchScheduled(String jobName, LocalDateTime scheduledAt);
    Long launchManual(String jobName, Map<String, String> params, String requestedBy);
    Long launchRecovery(String jobName, Long failedExecutionId);
}

@Service
@RequiredArgsConstructor
public class DefaultBatchOrchestrator implements BatchOrchestrator {

    private final ExecutionGuard executionGuard;
    private final JobLauncher jobLauncher;

    @Override
    public Long launchScheduled(String jobName, LocalDateTime scheduledAt) {
        String dedupKey = jobName + ":" + scheduledAt.toString();
        executionGuard.assertNotRunning(dedupKey);
        return run(jobName, Map.of("scheduledAt", scheduledAt.toString(), "dedupKey", dedupKey));
    }

    private Long run(String jobName, Map<String, String> params) {
        // 실행 이력 저장 후 Spring Batch Job launch
        return 1L;
    }
}

Example B: SQL for Operations Dashboard

SELECT job_name,
       COUNT(*) FILTER (WHERE status = 'SUCCEEDED') AS success_count,
       COUNT(*) FILTER (WHERE status = 'FAILED') AS failed_count,
       AVG(duration_ms) AS avg_duration_ms,
       PERCENTILE_CONT(0.99) WITHIN GROUP (ORDER BY duration_ms) AS p99_duration_ms
FROM batch_job_execution
WHERE started_at >= NOW() - INTERVAL '24 hour'
GROUP BY job_name
ORDER BY failed_count DESC, p99_duration_ms DESC;

Example C: Keyset query for reprocessing

SELECT id, job_name, error_code, started_at
FROM batch_job_execution
WHERE status = 'FAILED'
  AND id > :last_id
ORDER BY id ASC
LIMIT 300;

Example D: OpenSearch Incremental Synchronization Query

POST /_search
{
  "size": 1000,
  "sort": [
    { "updated_at": "asc" },
    { "_id": "asc" }
  ],
  "search_after": ["2026-03-03T08:00:00Z", "product-9988"],
  "query": {
    "range": {
      "updated_at": { "gte": "2026-03-03T00:00:00Z" }
    }
  }
}

4) Real-world failure/operational scenarios

Situation: Order settlement (Quartz), product indexing (Spring Batch+OpenSearch), and operator reprocessing API simultaneously increased during Black Friday traffic. The common DB connection pool was depleted, the entire deployment was delayed, and some failed due to timeouts.

Cause:

Each batch was individually optimized, so there was no overall system limit model.
Resource priority (Settlement > Reindex > Report) is not set.
Failed batch retries were performed immediately, increasing the load.

Improvements:

Fixed resource budget (CPU/DB pool/IO) by batch type.
Introducing priority queues and automatic delay of low-priority tasks.
Apply exponential backoff + no execution window during rush hour for failure retry.

5) Design Checklist

Do you first classify batch requirements into schedule/event/manual/bulk/NRT?
Has the execution entity (@Scheduled/Quartz/K8s/Manual) been clearly defined?
Was the DB search strategy designed around Keyset/Range?
Does OpenSearch use PIT + Search After + Bulk control?
Are lock/leader election/idempotency/fencing tokens designed together?
Are DLQ and reprocessing operating procedures documented?
Are performance goals and failure recovery goals (RTO/RPO) managed numerically?

6) Summary

The conclusion of this series is simple. Deployment is an operating system issue, not a technology issue. @Scheduled, Quartz, Spring Batch, and manual deployment are not a competitive relationship, but a role sharing relationship. It must be combined according to requirements, failure models, and operational capabilities.

7) Next episode preview

The series ends with this episode. In a follow-up article, we will cover the implementation repo structure and observation dashboard template to “coexist four deployment models in a single service” based on an actual sample project.

Reference link

Previous post: Part 11. 장애 대응 아키텍처: Partial Failure, Poison Data, DLQ, 재시도, 멱등성
Next post: None (last part of this series)