Part 11. Failure response architecture: Partial Failure, Poison Data, DLQ, Retry, Idempotence
We organize failure classification, DLQ, retry, and idempotency patterns to design deployment failures into a recoverable state rather than avoidable ones.
Series: Spring Boot 배치 전략 완전 정복
총 12편 구성. 현재 11편을 보고 있습니다.
- 01Part 1. Nature and classification of batches: schedule, event, manual, bulk, near-real-time
- 02Part 2. @Scheduled in action: The price of simplicity and the multi-instance pitfalls
- 03Part 3. Quartz cluster architecture: JobStore, Misfire, large-scale schedule management
- 04Part 4. Spring Batch core: Chunk, transaction boundary, restartable job design
- 05Part 5. Spring Batch Extension: Tradeoff between Partition and Multi-threaded Step
- 06Part 6. Manual Deployment Strategy: REST Triggers, Admin UI, Parameter Reprocessing, Rollback
- 07Part 7. DB Bulk Search Strategy: OFFSET/LIMIT Limits and Keyset, ID Range, Covering Index
- 08Part 8. OpenSearch/Elasticsearch Deployment Strategy: Scroll, Search After, PIT, Bulk, Rollover
- 09Part 9. Distributed environment deployment: Leader Election, Kubernetes CronJob, and lock strategy comparison
- 10Part 10. Performance Optimization: Batch Size, Commit Interval, JVM Memory, Backpressure
- 11Part 11. Failure response architecture: Partial Failure, Poison Data, DLQ, Retry, IdempotenceCURRENT
- 12Part 12. Integrated reference architecture and final selection guide

Source: Pexels - Dashboard warning light
Based on version
- Java 21
- Spring Boot 3.3.x
- Spring Batch 5.2.x
- Quartz 2.3.x
- PostgreSQL 15
- OpenSearch 2.x
1) Raise a problem
In operational deployments, failures are the norm, not the exception. Network timeouts, external API limitations, and data quality errors inevitably occur. Instead of “eliminating” failures, we need to design to “isolate and recover” from them.
Most accidents grow into the pattern below.
- The retry policy applies equally to all exceptions.
- The entire job is stopped due to one Poison Data.
- There is no duplicate execution defense, so retry itself creates side effects.
2) Summary of key concepts
Failure classification
| Failure type | Example | Processing method |
|---|---|---|
| Transient | timeout, 429, deadlock | Limit retry + backoff |
| Permanent failure (Permanent) | Schema mismatch, missing required value | Manual processing after DLQ isolation |
| Partial failure | 120 failures out of 10,000 | Successful commits + failures accumulated separately |
Poison Data Processing
Poison data is data that continues to fail even when repeatedly executed with the same input. The key is “quick isolation.”
- Immediate DLQ movement when the maximum number of retries is exceeded.
- Save the original payload, error stack, processing point, and execution version together.
- The DLQ reprocessing path is separated into a separate approval flow.
Failure state transition diagram

Source: Pexels - Road warning sign
3) Code example
Example A: Retry policy by exception type
public boolean shouldRetry(Throwable t) {
return t instanceof SocketTimeoutException
|| t instanceof DeadlockLoserDataAccessException
|| t instanceof HttpServerErrorException;
}
public Duration backoff(int attempt) {
long seconds = Math.min((long) Math.pow(2, attempt), 60);
return Duration.ofSeconds(seconds);
}
Example B: Idempotent Writer (no duplication)
INSERT INTO payout_result (request_id, account_id, amount, status, processed_at)
VALUES (:request_id, :account_id, :amount, :status, NOW())
ON CONFLICT (request_id)
DO NOTHING;
Example C: DLQ table design
CREATE TABLE batch_dlq (
id BIGSERIAL PRIMARY KEY,
job_name VARCHAR(100) NOT NULL,
source_id BIGINT NOT NULL,
payload JSONB NOT NULL,
error_type VARCHAR(100) NOT NULL,
error_message TEXT NOT NULL,
retry_count INT NOT NULL,
failed_at TIMESTAMP NOT NULL DEFAULT NOW(),
reprocess_status VARCHAR(20) NOT NULL DEFAULT 'PENDING'
);
CREATE INDEX idx_batch_dlq_job_failed ON batch_dlq (job_name, failed_at DESC);
CREATE INDEX idx_batch_dlq_reprocess_id ON batch_dlq (reprocess_status, id);
Example D: DLQ Reprocessing Keyset Query
SELECT id, source_id, payload
FROM batch_dlq
WHERE reprocess_status = 'PENDING'
AND id > :last_id
ORDER BY id ASC
LIMIT 200;
4) Real-world failure/operational scenarios
Situation: 8,000 out of 40,000 cases failed due to external settlement API failure (HTTP 500). The system performed infinite retries, and even after the external API was restored, the failure occurred again due to a flood of requests.
Cause:
- There was no retry cap and no exponential backoff.
- No distinction was made between permanent failure (incorrect account number) and temporary failure.
- Duplicate withdrawal attempts occurred due to retrying without an idempotent key.
Improvements:
- Retry based on exception classification (maximum 3 times) + exponential backoff applied.
- Permanent failure results in immediate DLQ isolation.
request_idPrevent duplicate withdrawals with unique index.
5) Design Checklist
- Are failures classified as temporary/permanent/partial failures?
- Are the maximum number of retries and backoff policy defined?
- Is there a path to isolate poison data to DLQ?
- Is reproducible information (input/error/version) stored in the DLQ payload?
- Do you block duplicate side effects with idempotent keys and unique indexes?
- Does DLQ reprocessing also have an audit trail and approval flow?
6) Summary
The goal of failure-responsive design is not to “eliminate failure,” but to “keep failure controllable.” By designing retry, DLQ, and idempotency together, you can ensure the possibility of recovery even from large-scale failures. In particular, rather than reducing the number of failures, operators must be able to recover from failures within a predictable time.
7) Next episode preview
In the next part (the final part), we will integrate the contents so far and present a reference architecture and decision matrix. Provides the final selection criteria for under what circumstances @Scheduled, Quartz, Spring Batch, and manual deployment should be combined.
Reference link
- Spring Batch Reference
- Quartz Scheduler Documentation
- PostgreSQL Transaction Isolation
- 블로그: Idempotency Key API 설계
Series navigation
- Previous post: Part 10. 성능 최적화: 배치 사이즈, 커밋 간격, JVM 메모리, Backpressure
- Next post: Part 12. 통합 레퍼런스 아키텍처와 최종 선택 가이드