Part 11. Failure response architecture: Partial Failure, Poison Data, DLQ, Retry, Idempotence

Source: Pexels - Dashboard warning light

Based on version

Java 21
Spring Boot 3.3.x
Spring Batch 5.2.x
Quartz 2.3.x
PostgreSQL 15
OpenSearch 2.x

1) Raise a problem

In operational deployments, failures are the norm, not the exception. Network timeouts, external API limitations, and data quality errors inevitably occur. Instead of “eliminating” failures, we need to design to “isolate and recover” from them.

Most accidents grow into the pattern below.

The retry policy applies equally to all exceptions.
The entire job is stopped due to one Poison Data.
There is no duplicate execution defense, so retry itself creates side effects.

2) Summary of key concepts

Failure classification

Failure type	Example	Processing method
Transient	timeout, 429, deadlock	Limit retry + backoff
Permanent failure (Permanent)	Schema mismatch, missing required value	Manual processing after DLQ isolation
Partial failure	120 failures out of 10,000	Successful commits + failures accumulated separately

Poison Data Processing

Poison data is data that continues to fail even when repeatedly executed with the same input. The key is “quick isolation.”

Immediate DLQ movement when the maximum number of retries is exceeded.
Save the original payload, error stack, processing point, and execution version together.
The DLQ reprocessing path is separated into a separate approval flow.

Failure state transition diagram

Mermaid diagram rendering...

Source: Pexels - Road warning sign

3) Code example

Example A: Retry policy by exception type

public boolean shouldRetry(Throwable t) {
    return t instanceof SocketTimeoutException
        || t instanceof DeadlockLoserDataAccessException
        || t instanceof HttpServerErrorException;
}

public Duration backoff(int attempt) {
    long seconds = Math.min((long) Math.pow(2, attempt), 60);
    return Duration.ofSeconds(seconds);
}

Example B: Idempotent Writer (no duplication)

INSERT INTO payout_result (request_id, account_id, amount, status, processed_at)
VALUES (:request_id, :account_id, :amount, :status, NOW())
ON CONFLICT (request_id)
DO NOTHING;

Example C: DLQ table design

CREATE TABLE batch_dlq (
    id BIGSERIAL PRIMARY KEY,
    job_name VARCHAR(100) NOT NULL,
    source_id BIGINT NOT NULL,
    payload JSONB NOT NULL,
    error_type VARCHAR(100) NOT NULL,
    error_message TEXT NOT NULL,
    retry_count INT NOT NULL,
    failed_at TIMESTAMP NOT NULL DEFAULT NOW(),
    reprocess_status VARCHAR(20) NOT NULL DEFAULT 'PENDING'
);

CREATE INDEX idx_batch_dlq_job_failed ON batch_dlq (job_name, failed_at DESC);
CREATE INDEX idx_batch_dlq_reprocess_id ON batch_dlq (reprocess_status, id);

Example D: DLQ Reprocessing Keyset Query

SELECT id, source_id, payload
FROM batch_dlq
WHERE reprocess_status = 'PENDING'
  AND id > :last_id
ORDER BY id ASC
LIMIT 200;

4) Real-world failure/operational scenarios

Situation: 8,000 out of 40,000 cases failed due to external settlement API failure (HTTP 500). The system performed infinite retries, and even after the external API was restored, the failure occurred again due to a flood of requests.

Cause:

There was no retry cap and no exponential backoff.
No distinction was made between permanent failure (incorrect account number) and temporary failure.
Duplicate withdrawal attempts occurred due to retrying without an idempotent key.

Improvements:

Retry based on exception classification (maximum 3 times) + exponential backoff applied.
Permanent failure results in immediate DLQ isolation.
request_id Prevent duplicate withdrawals with unique index.

5) Design Checklist

Are failures classified as temporary/permanent/partial failures?
Are the maximum number of retries and backoff policy defined?
Is there a path to isolate poison data to DLQ?
Is reproducible information (input/error/version) stored in the DLQ payload?
Do you block duplicate side effects with idempotent keys and unique indexes?
Does DLQ reprocessing also have an audit trail and approval flow?

6) Summary

The goal of failure-responsive design is not to “eliminate failure,” but to “keep failure controllable.” By designing retry, DLQ, and idempotency together, you can ensure the possibility of recovery even from large-scale failures. In particular, rather than reducing the number of failures, operators must be able to recover from failures within a predictable time.

7) Next episode preview

In the next part (the final part), we will integrate the contents so far and present a reference architecture and decision matrix. Provides the final selection criteria for under what circumstances @Scheduled, Quartz, Spring Batch, and manual deployment should be combined.

Reference link

Previous post: Part 10. 성능 최적화: 배치 사이즈, 커밋 간격, JVM 메모리, Backpressure
Next post: Part 12. 통합 레퍼런스 아키텍처와 최종 선택 가이드