The Saga Pattern: Managing Distributed Transactions Without Losing Your Mind

Imagine it’s Friday. 4:15 PM. You’ve just deployed the new “Order-to-Delivery” microservice flow. Service A calls Service B, B calls C. It’s elegant. It’s “decoupled.” You go home and feel like a cloud-native architect.

Then, at 2 AM, the pager goes off.

A customer bought a $2,000 espresso machine. Service A (Orders) created the record. Service B (Payments) successfully charged the card. But Service C (Inventory) was having a moment because someone ran a backup during peak hours. The inventory update failed.

The result? The customer has no espresso machine, you have their $2,000, and your Order service thinks everything is fine. You spend the next four hours manually running SQL updates across three production databases, squinting at logs, trying to figure out what to refund and what to delete.

This is the distributed transaction nightmare. We killed the monolith and in our excitement buried ACID transactions in the same grave. We traded one manageable fire for a thousand invisible ones. Welcome to the Saga Pattern — the industry’s way of admitting that distributed systems are chaotic and “eventual consistency” is just a polite way of saying “it’ll be correct… eventually, hopefully.”


The Distributed Transaction Lies We Tell Ourselves

  1. “The network is reliable.” It’s not. It’s tubes held together by hope.
  2. “I’ll just use 2PC (Two-Phase Commit).” Your modern database probably doesn’t support it. Even if it did, the latency would turn your API into a carrier pigeon service.
  3. “I’ll just add a @Transactional annotation.” That works great — on one database. Your data is scattered across three AWS regions and a legacy server in someone’s basement.
  4. “I’ll fix the inconsistencies with a cron job on Monday.” Monday comes, the cron fails, now you have two problems.

Phase 1: The Optimistic Approach (and Why It Fails)

The developer who believes in the stability of the internet writes code that assumes everything works on the first try.

@Service
public class OrderService {
public void createOrder(OrderRequest request) {
Order order = orderRepository.save(new Order(request));
paymentClient.charge(order.getId(), request.getAmount());
inventoryClient.reserve(order.getSku(), order.getQuantity());
shippingClient.schedule(order.getId());
}
}

If inventoryClient.reserve() throws a 500, the payment is already gone. The order record is saved. The customer is calling support. This isn’t a service — it’s a liability with a nice method name.

Phase 2: The Over-Engineered Framework Trap

The mid-level developer discovers Design Patterns last week and decides to use all of them. The result is a generic framework built before the business problem is even solved.

public interface ISagaStep<T extends IBaseContext> {
CompletableFuture<SagaResult> executeAsync(T context);
CompletableFuture<Void> compensateAsync(T context);
}
public abstract class AbstractDistributedSagaManager<T extends SagaContext> {
private final ITransactionCoordinatorStrategyFactory factory;
private final List<ISagaStep<T>> steps = new ArrayList<>();
// ...
}

They’ve built a generic Saga Framework before solving the actual business problem. It’s hard to test, impossible to debug, and the first time a network timeout occurs, the state machine gets stuck in PENDING_COMPENSATION_FINAL_RETRY_V2 indefinitely.


Building a Proper State-Machine-Based Saga

We need Compensating Transactions and a clear understanding of the Pivot Point. No magic, no generic frameworks — just deliberate design.

Step 1: Define the Saga State

A Saga is a long-running conversation. If you don’t track what was said, you can’t recover from a failure.

public enum SagaStatus {
STARTED,
PAYMENT_COMPLETED,
INVENTORY_RESERVED,
ORDER_COMPLETED,
PAYMENT_FAILED,
INVENTORY_FAILED,
COMPENSATED
}
@Entity
public class OrderSagaState {
@Id
private UUID sagaId;
private Long orderId;
private SagaStatus status;
private String lastError;
}

Step 2: The Orchestrator and the Pivot Point

Every Saga has three types of transactions:

  • Compensatable: Can be undone (e.g., releasing reserved inventory).
  • Pivot: The point of no return — usually the payment. Once past this, we must complete forward.
  • Retriable: Transactions after the Pivot that cannot fail permanently (e.g., scheduling shipping). If they fail, retry until success.
public void execute(OrderRequest request) {
UUID sagaId = UUID.randomUUID();
try {
// 1. Compensatable
inventoryService.reserve(request.getSku());
updateSagaStatus(sagaId, SagaStatus.INVENTORY_RESERVED);
// 2. THE PIVOT (point of no return)
paymentService.charge(request.getAmount());
updateSagaStatus(sagaId, SagaStatus.PAYMENT_COMPLETED);
// 3. Retriable
shippingService.schedule(request.getOrderId());
updateSagaStatus(sagaId, SagaStatus.ORDER_COMPLETED);
} catch (Exception e) {
handleFailure(sagaId, request);
}
}

Step 3: Compensating Transactions and Failure Logic

If failure happens before the Pivot, undo everything. If failure happens after the Pivot, retry — never refund.

private void handleFailure(UUID sagaId, OrderRequest request) {
OrderSagaState state = repository.findById(sagaId);
if (state.getStatus() == SagaStatus.INVENTORY_RESERVED) {
// Failed at the Pivot. Undo the inventory reservation.
inventoryService.release(request.getSku());
updateSagaStatus(sagaId, SagaStatus.COMPENSATED);
}
if (state.getStatus() == SagaStatus.PAYMENT_COMPLETED) {
// Past the pivot. Don't refund. Retry shipping.
messageQueue.sendToRetry(new ShippingTask(request.getOrderId()));
}
}

Step 4: Semantic Locking for Dirty Data

Sagas are ACD, not ACID — we lose Isolation. While a Saga is running, another process might read intermediate state. Use a semantic lock to mark resources as in-progress.

@Service
public class InventoryService {
public void reserve(String sku) {
inventoryRepo.updateStatus(sku, "LOCKED_BY_SAGA");
}
public void release(String sku) {
inventoryRepo.updateStatus(sku, "AVAILABLE");
}
}

Orchestration vs. Choreography: A Practical View

Choreography sounds appealing — each service listens to events and knows what to do next, no central coordinator. In practice, debugging a flow that spans six services with no central state is like playing Telephone with $100 bills.

For 90% of business cases, a central Orchestrator wins because it’s observable. You can point at a dashboard, see exactly where the order died, and know which compensation ran. That’s worth the coupling.

On the “overhead of persisting Saga state” concern: a few milliseconds of DB write for the state machine is cheaper than your lead developer spending 10 hours on a Saturday manually reconciling bank statements with database records. Do the math.


Why Teams Skip Sagas Until It’s Too Late

Two patterns drive this. The first is Resume-Driven Development: implementing a proper Saga with AWS Step Functions or Spring Statemachine feels “heavy” and “not agile,” so we choose the easy path of nested REST calls. We ship today and let the maintenance version of ourselves deal with the data corruption six months out.

The second is Cargo Cult Programming. We see Netflix using microservices so we do it too, without the infrastructure investment. Netflix has a thousand engineers to build Saga infrastructure. You have a Jira ticket and a deadline of Tuesday. When you move to microservices without a Saga strategy, you aren’t building a distributed system — you’re building a distributed liability.


Actionable Steps to Implement Sagas Correctly

  1. Identify your Pivot Point. Usually the action hardest to undo — the money movement.
  2. Define your Compensations first. If you can’t undo an action, it must happen after the Pivot and must be retriable.
  3. Don’t build your own framework. Use AWS Step Functions, Temporal, or Spring Statemachine. These tools handle the “what if the Orchestrator itself crashes?” problem, which you do not want to solve yourself.
  4. Enforce idempotency everywhere. Every service in a Saga must handle receiving the same command five times and execute it only once. If your refund() endpoint isn’t idempotent, you will have a very bad Saturday.

Distributed systems are hard because we pretend they’re just monoliths with network calls. They’re not. They’re chaotic environments where everything that can go wrong, will — usually at 2 AM on a Friday.

The Saga pattern isn’t about making things perfect. It’s about having a plan for when things burn. Now go delete that nested try-catch block and actually design your failure paths.


Discover more from The Dev World – Sergio Lema

Subscribe to get the latest posts sent to your email.


Comments

Leave a comment