3 min read

Part 3. Quartz cluster architecture: JobStore, Misfire, large-scale schedule management

This course covers JobStore, Misfire, and cluster design standards for using Quartz as an operational control system rather than a simple scheduler.

Series: Spring Boot 배치 전략 완전 정복

12편 구성. 현재 3편을 보고 있습니다.

썸네일 - 알람 시계와 스케줄
썸네일 - 알람 시계와 스케줄

Source: Pexels - Black Alarm Clock on Desk

Based on version

  • Java 21
  • Spring Boot 3.3.x
  • Spring Batch 5.2.x
  • Quartz 2.3.x
  • PostgreSQL 15
  • OpenSearch 2.x

1) Raise a problem

There is a point where @Scheduled is difficult to handle. Typically, the following conditions are:

  • The number of tasks increases to hundreds or more.
  • The priorities, calendar exclusion dates, and retry policies for each task are different.
  • It is important to use what rules to compensate for “missed execution” after a failure.

At this point, Quartz becomes a “scheduled state store + execution controller” rather than a simple cron tool. However, introducing Quartz does not automatically create stability. If the JobStore and Misfire policies are incorrectly implemented, obstacles will actually be magnified.

2) Summary of key concepts

Select JobStore

JobStoreAdvantagesDisadvantagesRecommended Situation
RAMJobStoreFast and simpleSchedule loss upon restart, cluster not possibleLocal development/testing
JDBCJobStorePersistence, Cluster Support, Audit TrailDB load/schema management requiredSelect operating environment default

In production environments, JDBCJobStore is the de facto standard. In cluster mode, Quartz instances share the same QRTZ_* table, and execution permissions are adjusted using locks (row locks).

Misfire Policy

Misfire means “a trigger that fails to execute at the designated time.” If you select the wrong policy, a traffic bomb will occur during disaster recovery.

  • MISFIRE_INSTRUCTION_FIRE_NOW: Run immediately. The number of missing items is quickly corrected, but sudden load is possible.
  • MISFIRE_INSTRUCTION_DO_NOTHING: Skip to the next schedule. Safety takes precedence over consistency.
  • Custom reschedule: Limit correction amount by domain importance.

Cluster diagram

Mermaid diagram rendering...

본문 이미지 - 데이터센터 운영자
본문 이미지 - 데이터센터 운영자

Source: Pexels - Engineer beside server racks

3) Code example

Example A: Quartz cluster setup (Spring Boot)

spring:
  quartz:
    job-store-type: jdbc
    properties:
      org.quartz.scheduler.instanceName: batchScheduler
      org.quartz.scheduler.instanceId: AUTO
      org.quartz.jobStore.class: org.quartz.impl.jdbcjobstore.JobStoreTX
      org.quartz.jobStore.driverDelegateClass: org.quartz.impl.jdbcjobstore.PostgreSQLDelegate
      org.quartz.jobStore.isClustered: true
      org.quartz.jobStore.clusterCheckinInterval: 15000
      org.quartz.threadPool.threadCount: 20

Example B: Prevent concurrent execution Job

@DisallowConcurrentExecution
public class ReindexJob implements Job {

    @Override
    public void execute(JobExecutionContext context) throws JobExecutionException {
        String tenantId = context.getMergedJobDataMap().getString("tenantId");
        try {
            // 실제 재색인 로직
            reindexTenant(tenantId);
        } catch (Exception ex) {
            throw new JobExecutionException(ex, true);
        }
    }

    private void reindexTenant(String tenantId) {
        // id-range 또는 search-after 기반으로 분할 실행
    }
}

Example C: Misfire monitoring SQL

SELECT trigger_name,
       trigger_group,
       next_fire_time,
       prev_fire_time,
       misfire_instr
FROM qrtz_triggers
WHERE next_fire_time < EXTRACT(EPOCH FROM NOW()) * 1000
ORDER BY next_fire_time ASC
LIMIT 200;

Example D: Quartz execution history keyset query

SELECT id, scheduler_name, job_name, status, started_at
FROM batch_job_execution
WHERE scheduler_name = 'quartz'
  AND id > :last_id
ORDER BY id ASC
LIMIT 500;

4) Real-world failure/operational scenarios

Situation: When Quartz was restored after being down for 25 minutes due to DB maintenance, 1,200 Misfire triggers were accumulated. Because all policies were FIRE_NOW, immediately after recovery, concurrent executions exploded, the OLTP DB connection pool was depleted, and API delays occurred.

Analysis:

  • The same Misfire policy was applied to important tasks (settlement) and less important tasks (report generation).
  • It was increased to threadCount=50, but the downstream limit (DB pool 30) was not considered.
  • Lock maintenance time has increased by unifying the transaction isolation level to REPEATABLE READ.

Improvements:

  1. Separate Misfire profile by domain: Limited FIRE_NOW for settlement, DO_NOTHING for reporting.
  2. Adjust Quartz execution thread, DB pool, and external API QPS to the same limit model.
  3. Failure retries are distributed with an exponential backoff rather than immediate.

5) Design Checklist

  • Are you not using RAMJobStore in your operating environment?
  • Are misfire policies separated by business importance?
  • Have you tuned Quartz threadCount and DB/external system capacity together?
  • @DisallowConcurrentExecution Has the scope of application been reviewed?
  • Are the cluster check-in interval and failure detection time defined numerically?
  • Are the execution history and number of missed cases (Misfire backlog) observed through a dashboard?

6) Summary

Quartz is a powerful option when you need “fine-grained schedule control.” However, failures can be amplified if JobStore, Misfire, and thread settings are not aligned with business consistency and infrastructure limits.

7) Next episode preview

In the next episode, we will cover Spring Batch's chunk processing model in depth. Explains why Spring Batch is the standard in Reader/Processor/Writer transaction boundaries, restart strategies, and bulk data processing.

Series navigation

Comments