Part 4. Spring Batch core: Chunk, transaction boundary, restartable job design

Based on version

Java 21
Spring Boot 3.3.x
Spring Batch 5.2.x
Quartz 2.3.x
PostgreSQL 15
OpenSearch 2.x

1) Raise a problem

The reason for introducing Spring Batch is “a structure that can be run again after failure” rather than “mass processing” itself. In actual operation, major failures arise from the following two factors rather than lack of performance.

I don't know how far it has been committed.
When restarted, duplicate processing or omission is processed.

In other words, the key is not speed but vigilance. The reason for dividing Reader, Processor, and Writer is not to separate concerns but to clarify transaction boundaries.

2) Summary of key concepts

Chunk processing model

The basic unit of Spring Batch is “chunk”.

Reader reads N items.
Processor converts/verifies.
Writer records at once.
Commit in chunk units.

As the chunk size increases, DB round-trip decreases, but in case of failure, rollback range and memory usage increase. Conversely, if the chunk is too small, the commit overhead increases.

transaction boundary

Basically, a chunk unit transaction is opened in the Step.
If the Reader is a DB cursor type, you must be careful not to keep it occupied for a long time in the same transaction.
The isolation level usually starts with READ COMMITTED, and financial calculations that require repeated inquiry of the same data review REPEATABLE READ.

Restart Strategy

Store the last processing key (e.g. lastId) in ExecutionContext.
Writer is composed of idempotent upserts (ON CONFLICT DO UPDATE).
The skip/retry policy is fixed to be “reproducible upon restart.”

Processing structure diagram

Mermaid diagram rendering...

Source: Pexels - Security control room team

3) Code example

Example A: Chunk Step Configuration

@Bean
public Step billingStep(JobRepository jobRepository,
                        PlatformTransactionManager transactionManager,
                        ItemReader<BillingTarget> reader,
                        ItemProcessor<BillingTarget, BillingResult> processor,
                        ItemWriter<BillingResult> writer) {
    return new StepBuilder("billingStep", jobRepository)
        .<BillingTarget, BillingResult>chunk(1000, transactionManager)
        .reader(reader)
        .processor(processor)
        .writer(writer)
        .faultTolerant()
        .skip(InvalidBillingDataException.class)
        .skipLimit(100)
        .retry(DeadlockLoserDataAccessException.class)
        .retryLimit(3)
        .build();
}

Example B: Keyset Reader SQL

SELECT id, account_id, amount, due_date
FROM billing_target
WHERE status = 'PENDING'
  AND id > :last_id
ORDER BY id ASC
LIMIT :chunk_size;

Example C: Idempotent Writer (PostgreSQL)

INSERT INTO billing_result (target_id, billed_amount, billed_at)
VALUES (:target_id, :billed_amount, NOW())
ON CONFLICT (target_id)
DO UPDATE SET
    billed_amount = EXCLUDED.billed_amount,
    billed_at = EXCLUDED.billed_at;

Example D: Index design

CREATE INDEX idx_billing_target_status_id ON billing_target (status, id);
CREATE UNIQUE INDEX uk_billing_result_target_id ON billing_result (target_id);

4) Real-world failure/operational scenarios

Situation: I increased throughput by increasing the chunk size to 5,000, but after 2 hours, OOM and a long GC pause occurred. Just before failure, the chunk was rolled back and 5,000 cases were reprocessed, and the external payment API was called repeatedly.

Cause:

The processor stored external API responses cumulatively in an object, resulting in excessive memory usage per chunk.
Writer was insert-only, not idempotent upsert.
The retry policy includes external API exceptions, expanding duplicate calls.

Improvements:

Reduce the chunk to 1,000 and change the Processor memory object to be released immediately.
External API calls are separated into outbox patterns, and placement is only responsible for request creation.
Replace Writer with idempotent upsert.

5) Design Checklist

Did you calculate the chunk size as the estimated memory usage (object size per case x chunk x thread)?
Is it possible for Reader to restart based on keyset or range?
Does Writer guarantee idempotence?
Does the skip/retry policy match the domain rules?
Was the transaction isolation level chosen intentionally?
Is the restart key pointer stored in ExecutionContext?

6) Summary

The real value of Spring Batch is not “mass processing” but “controllable failure”. To survive operation, you must make clear the chunk boundaries and design an idempotent writer and restart pointer.

7) Next episode preview

In the next part, we will cover Spring Batch's parallel processing strategy. We summarize the criteria for selecting Partition, Multi-threaded Step, and remote chunks and the actual throughput/consistency trade-off.