Why Your Event-Driven System Needs a Recovery Contract, Not Just a Retry

A Kafka replay that processes 1,842 events, skips 17 duplicates, and produces a confidence_status: "trusted" record is the difference between operational hope and architectural confidence. That evidence record, not the rebuilt state alone, is what separates systems that merely resume from systems that actually recover.

Where Reliability Ends and Replayability Begins

Reliability asks whether the system keeps working. Replayability asks whether the system can safely revisit history and prove the result is correct. In high-throughput event-driven pipelines—inventory updates, usage-based billing, security analytics, payments—the failure often appears later as disagreement between derived states. Inventory says 388 units, the selling engine says 380, the warehouse says 379, and nobody can explain why. The hard question is not "did Kafka lose the message?" It's "which state is correct, and how do we prove it?"

The Seven Questions of a Recovery Contract

The article introduces a design artifact called a Recovery Contract. It answers exactly seven questions for any critical event flow:

H: Authoritative history
O: Ordering boundary
I: Idempotency key
F: Deterministic projection function
S: Replay scope
Q: Reconciliation query or invariant
E: Recovery evidence

For a real-time inventory pipeline using Kafka, PostgreSQL/Aurora, Debezium CDC, and Kafka Streams, the contract might define sku as the ordering key, event_id as the idempotency key, and checks like stock_on_hand_matches_transactions and sellable_quantity_non_negative. The evidence emitted after replay includes events_processed, duplicates_skipped, projections_changed, reconciliation_failures, and confidence_status. Dead-letter queues don't replace this—they only tell you where some failures landed, not whether derived state was already partially corrupted.

Why Partitioning Is a Correctness Boundary

Partitioning in Kafka is usually treated as a scaling mechanism. For stateful event processing, it's also a correctness boundary. If all updates for the same SKU must be processed in order, the SKU belongs in the partition key. Scatter events for the same entity across partitions, and replay becomes harder to reason about. The article provides a concrete SkuPartitioner that uses Math.floorMod(orderingKey.hashCode(), partitionCount)—the implementation is simple, but the decision behind it is architectural, not infrastructure.

Schema governance is another replay concern. If a consumer can't read a six-month-old event because a field changed meaning, replay produces a technically valid but semantically wrong projection. A Recovery Contract should name the schema compatibility expectations, ensuring changes don't silently destroy the ability to recover.

The strongest systems are not the ones that never fail—they are the ones that can restore justified confidence quickly. Writing down those seven answers before an incident turns replay from a dangerous manual operation into a controlled, verifiable workflow.

Source: Building Event-Driven Systems That Can Recover With Confidence
Domain: hackernoon.com

Why Your Event-Driven System Needs a Recovery Contract, Not Just a Retry

Where Reliability Ends and Replayability Begins

The Seven Questions of a Recovery Contract

Why Partitioning Is a Correctness Boundary

More in Systems Engineering