Part 9. Distributed environment deployment: Leader Election, Kubernetes CronJob, and lock strategy comparison

Source: Pexels - Row of blue shipping containers

Based on version

Java 21
Spring Boot 3.3.x
Spring Batch 5.2.x
Quartz 2.3.x
PostgreSQL 15
OpenSearch 2.x

1) Raise a problem

The most common deployment failure in a distributed environment is “duplicate execution.” The moment an application scales to 10 pods, the deployment triggers can be multiplied by 10. This problem arises not because of technology selection, but because the “executing agent” is not clearly defined.

There are two key questions.

Who will trigger the deployment? (Kubernetes vs application internal)
Where do we create a guarantee that only one will run at the same time? (Lock/Leader election)

2) Summary of key concepts

Kubernetes CronJob vs in-app scheduler

standards	Kubernetes CronJob	In-app scheduler (`@Scheduled`/Quartz)
Execution entity	Platform (K8s)	Application
Deployment/Restart Independence	High	Significant impact on app life cycle
Code Proximity	Low (External Job Pod)	High (same codebase)
Operational Difficulty	K8s dependency	App code/lock design dependencies
Recommended Situation	Simple independent deployment, infrastructure standardization	Placement closely related to domain logic

Comparison of lock/leader election strategies

Strategy	Advantages	Disadvantages	Recommended Situation
DB lock	Strong consistency, no additional infrastructure	Increased DB contention	Low trigger frequency and DB-centric
Redis lock	Fast and scalable	TTL/split brain considerations	High-frequency operation
Zookeeper/etcd elected	Strong in leader election	High operational complexity	Large platform team
K8s Lease	Kubernetes friendly	K8s dependent	K8s standard environment

Distributed execution control diagram

Mermaid diagram rendering...

Source: Pexels - Container ships at cargo port

3) Code example

Example A: Kubernetes CronJob

apiVersion: batch/v1
kind: CronJob
metadata:
  name: settlement-job
spec:
  schedule: "*/5 * * * *"
  concurrencyPolicy: Forbid
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 5
  jobTemplate:
    spec:
      template:
        spec:
          restartPolicy: Never
          containers:
            - name: settlement
              image: my-registry/batch:1.0.0
              args: ["--job=settlement"]

Example B: DB-based leader lock

CREATE TABLE batch_leader_lock (
    lock_name VARCHAR(100) PRIMARY KEY,
    holder_id VARCHAR(100) NOT NULL,
    expires_at TIMESTAMP NOT NULL
);

-- 리더 선점 시도
INSERT INTO batch_leader_lock (lock_name, holder_id, expires_at)
VALUES ('settlement', :holder_id, NOW() + INTERVAL '30 second')
ON CONFLICT (lock_name)
DO UPDATE SET holder_id = EXCLUDED.holder_id,
              expires_at = EXCLUDED.expires_at
WHERE batch_leader_lock.expires_at < NOW();

Example C: Check and execute lock in Spring service

public void runIfLeader() {
    boolean acquired = leaderLockRepository.tryAcquire("settlement", instanceId, Duration.ofSeconds(30));
    if (!acquired) {
        return;
    }
    try {
        settlementService.execute();
    } finally {
        leaderLockRepository.release("settlement", instanceId);
    }
}

Example D: Execution history keyset query

SELECT id, job_name, instance_id, status, started_at
FROM batch_job_execution
WHERE job_name = 'settlement'
  AND id > :last_id
ORDER BY id ASC
LIMIT 200;

4) Real-world failure/operational scenarios

Situation: Redis lock update failed due to network split, and both instances decided to be the leaders and executed the batch simultaneously (split-brain).

Cause:

There was only lock acquisition and no fencing token.
The child writer did not verify the “Only the latest token is valid” rule.
The lock TTL was short, so re-election occurred frequently during temporary delays.

Improvements:

Store monotonically increasing tokens when issuing a leader lock.
Reject low token operation after token verification in Writer.
Reset lock TTL and heartbeat interval based on network delay P99.

5) Design Checklist

Has the execution entity been clearly defined as either K8s CronJob or within the app?
concurrencyPolicy Or is concurrent execution blocked by a distributed lock?
Are fencing tokens applied to leader election?
Is there an interruption and recovery procedure in case of lock expiration/renewal failure?
Do you track execution history and leader change history?
Did you select a locking technology (DB/Redis/Zookeeper) that suits the platform team capabilities?

6) Summary

The core of distributed deployment design is not “execute only once,” but a structure that is “safe even if duplicate execution occurs.” Operational risks can be controlled by designing the execution entity, lock, idempotency, and fencing tokens together.

7) Next episode preview

The next part deals with performance optimization. Batch size, commit interval, JVM memory calculation, and backpressure design are explained with actual numerical models.