3 min read

Part 9. Distributed environment deployment: Leader Election, Kubernetes CronJob, and lock strategy comparison

We present criteria for selecting an execution entity/lock strategy to prevent duplicate execution of batches in multi-instances and control operational complexity.

Series: Spring Boot 배치 전략 완전 정복

12편 구성. 현재 9편을 보고 있습니다.

썸네일 - 분산된 컨테이너 환경
썸네일 - 분산된 컨테이너 환경

Source: Pexels - Row of blue shipping containers

Based on version

  • Java 21
  • Spring Boot 3.3.x
  • Spring Batch 5.2.x
  • Quartz 2.3.x
  • PostgreSQL 15
  • OpenSearch 2.x

1) Raise a problem

The most common deployment failure in a distributed environment is “duplicate execution.” The moment an application scales to 10 pods, the deployment triggers can be multiplied by 10. This problem arises not because of technology selection, but because the “executing agent” is not clearly defined.

There are two key questions.

  • Who will trigger the deployment? (Kubernetes vs application internal)
  • Where do we create a guarantee that only one will run at the same time? (Lock/Leader election)

2) Summary of key concepts

Kubernetes CronJob vs in-app scheduler

standardsKubernetes CronJobIn-app scheduler (@Scheduled/Quartz)
Execution entityPlatform (K8s)Application
Deployment/Restart IndependenceHighSignificant impact on app life cycle
Code ProximityLow (External Job Pod)High (same codebase)
Operational DifficultyK8s dependencyApp code/lock design dependencies
Recommended SituationSimple independent deployment, infrastructure standardizationPlacement closely related to domain logic

Comparison of lock/leader election strategies

StrategyAdvantagesDisadvantagesRecommended Situation
DB lockStrong consistency, no additional infrastructureIncreased DB contentionLow trigger frequency and DB-centric
Redis lockFast and scalableTTL/split brain considerationsHigh-frequency operation
Zookeeper/etcd electedStrong in leader electionHigh operational complexityLarge platform team
K8s LeaseKubernetes friendlyK8s dependentK8s standard environment

Distributed execution control diagram

Mermaid diagram rendering...

본문 이미지 - 포트 운영
본문 이미지 - 포트 운영

Source: Pexels - Container ships at cargo port

3) Code example

Example A: Kubernetes CronJob

apiVersion: batch/v1
kind: CronJob
metadata:
  name: settlement-job
spec:
  schedule: "*/5 * * * *"
  concurrencyPolicy: Forbid
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 5
  jobTemplate:
    spec:
      template:
        spec:
          restartPolicy: Never
          containers:
            - name: settlement
              image: my-registry/batch:1.0.0
              args: ["--job=settlement"]

Example B: DB-based leader lock

CREATE TABLE batch_leader_lock (
    lock_name VARCHAR(100) PRIMARY KEY,
    holder_id VARCHAR(100) NOT NULL,
    expires_at TIMESTAMP NOT NULL
);
-- 리더 선점 시도
INSERT INTO batch_leader_lock (lock_name, holder_id, expires_at)
VALUES ('settlement', :holder_id, NOW() + INTERVAL '30 second')
ON CONFLICT (lock_name)
DO UPDATE SET holder_id = EXCLUDED.holder_id,
              expires_at = EXCLUDED.expires_at
WHERE batch_leader_lock.expires_at < NOW();

Example C: Check and execute lock in Spring service

public void runIfLeader() {
    boolean acquired = leaderLockRepository.tryAcquire("settlement", instanceId, Duration.ofSeconds(30));
    if (!acquired) {
        return;
    }
    try {
        settlementService.execute();
    } finally {
        leaderLockRepository.release("settlement", instanceId);
    }
}

Example D: Execution history keyset query

SELECT id, job_name, instance_id, status, started_at
FROM batch_job_execution
WHERE job_name = 'settlement'
  AND id > :last_id
ORDER BY id ASC
LIMIT 200;

4) Real-world failure/operational scenarios

Situation: Redis lock update failed due to network split, and both instances decided to be the leaders and executed the batch simultaneously (split-brain).

Cause:

  • There was only lock acquisition and no fencing token.
  • The child writer did not verify the “Only the latest token is valid” rule.
  • The lock TTL was short, so re-election occurred frequently during temporary delays.

Improvements:

  1. Store monotonically increasing tokens when issuing a leader lock.
  2. Reject low token operation after token verification in Writer.
  3. Reset lock TTL and heartbeat interval based on network delay P99.

5) Design Checklist

  • Has the execution entity been clearly defined as either K8s CronJob or within the app?
  • concurrencyPolicy Or is concurrent execution blocked by a distributed lock?
  • Are fencing tokens applied to leader election?
  • Is there an interruption and recovery procedure in case of lock expiration/renewal failure?
  • Do you track execution history and leader change history?
  • Did you select a locking technology (DB/Redis/Zookeeper) that suits the platform team capabilities?

6) Summary

The core of distributed deployment design is not “execute only once,” but a structure that is “safe even if duplicate execution occurs.” Operational risks can be controlled by designing the execution entity, lock, idempotency, and fencing tokens together.

7) Next episode preview

The next part deals with performance optimization. Batch size, commit interval, JVM memory calculation, and backpressure design are explained with actual numerical models.

Series navigation

Comments