A field guide

Patterns are answers — but only to specific questions.

These ten patterns show up over and over in distributed systems. None of them is universally good. Each one solves a particular problem, costs you something elsewhere, and only earns its keep when its problem actually exists in your system.

10 patterns · ~12 min read · Updated 2026

How to read this guide

Each pattern follows the same shape: the problem it answers, the structural shape of the solution, and the pitfalls that come with it. The "When it fits" line at the end is where the trade-off lives — the conditions under which the pattern's costs are worth paying.

Gall's Law: A complex system that works is invariably found to have evolved from a simple system that worked. Most patterns on this page are mid-game moves, not opening ones.

Anti-Corruption Layer

A translation layer between bounded contexts so foreign models don't leak.

Problem

You depend on a system whose model is messy, legacy, or owned by someone else. Calling it directly lets its concepts — naming, validation, even bugs — seep into your domain.

Shape

A thin layer sits between you and the foreign system. Inbound, it translates their model into yours; outbound, the reverse. Your domain only ever speaks its own language.

Watch for

It becomes a junk drawer if every dependency routes through one big ACL. Keep one per integration, and don't let business logic accumulate inside it.

When it fits: Migrating off a legacy system, integrating with a vendor whose API you don't control, or protecting a clean bounded context from messy upstream models.

Strangler Fig

Replace a legacy system one route at a time, never with a big-bang rewrite.

Problem

You need to retire a legacy system, but rewriting it in one shot is high-risk and slow to deliver value. Stopping mid-flight has to remain a safe option.

Shape

Put a façade in front of the legacy system. Route by route, redirect to the new implementation. Over time the façade routes everything to the new system, and the old one is removed.

Watch for

The intermediate state can last for years. Keep the kill date visible — and actually delete the old code when it's unused, or you live with two implementations forever.

When it fits: Risky rewrites where stopping mid-way must remain safe. Anywhere the legacy system serves real traffic that you can't afford to break for a weekend.

API Gateway / BFF

A single front door that aggregates services and shapes responses per client.

Problem

Multiple clients (web, mobile, third-party) want different shapes of the same data, and each backend service exposes its own raw API. Clients shouldn't need five round-trips to render a screen.

Shape

A gateway sits at the edge: it authenticates, rate-limits, fans out to internal services, and assembles a response. A BFF (Backend for Frontend) is a related but distinct pattern — a per-client backend (one for web, one for mobile) typically owned by the client team, sitting in place of (or behind) a shared gateway. The point of a BFF is team ownership and client-specific shape, not just splitting one gateway in two.

Watch for

The gateway becomes a god service that owns everyone's logic. Keep it boring: routing, composition, cross-cutting concerns. No domain rules.

When it fits: Many backend services and many client types; or any time you want one place to enforce auth, rate limits, and observability at the edge.

Sidecar

Run a helper process beside the app to handle cross-cutting concerns.

Problem

Every service needs the same plumbing — TLS, metrics, log shipping, service discovery — and you don't want to reimplement it in every language and framework.

Shape

Deploy a second container alongside the app, sharing its pod and network namespace. On Kubernetes, "native sidecars" (init containers with restart policy Always) tie the sidecar's startup and shutdown to the main container; without that, lifecycle co-management is on you. The sidecar handles the cross-cutting concern; the app stays focused on business logic.

Watch for

Resource overhead per pod multiplies. Sidecars also add a hop on the data path, so latency and failure modes grow with the number you bolt on. The pattern is in retreat for service-mesh data planes — Istio Ambient and Cilium's sidecarless mesh are direct responses to that overhead.

When it fits: Polyglot environments needing consistent behavior, service-mesh data planes, log and metric collection across many languages without a per-language SDK.

Circuit Breaker

Stop calling a failing dependency before it drags you down with it.

Problem

A downstream service starts timing out. Every caller piles up retries, threads block, your own service starves and falls over. The failure cascades up the call graph.

Shape

Wrap each remote call in a breaker. When error rate crosses a threshold, the breaker opens — calls fail fast without hitting the dependency. After a cool-down it half-opens and lets a few probe requests through: if they succeed it closes (back to normal), if any fail it returns to open and the cool-down restarts.

Watch for

Wrong thresholds cause flapping. The breaker also needs a fallback strategy — failing fast is not the same as failing well. And it doesn't fix the broken dependency, only protects you from it.

When it fits: Any synchronous cross-service call where the downstream can fail independently. Especially valuable when timeouts are long enough to back up your own thread pool.

Bulkhead

Partition resources so one workload can't drown the others.

Problem

One slow tenant or one runaway feature exhausts the shared thread pool, connection pool, or queue — and every other request fails alongside it.

Shape

Give each workload its own slice of resources: separate thread pools, separate connection pools, separate queues. When one slice fills up, only that workload feels it.

Watch for

Static partitions waste capacity. You also need monitoring per slice — otherwise you'll only learn the bulkhead saved you when reading logs after an incident that never reached you.

When it fits: Multi-tenant systems, mixed workloads (cheap reads sharing a service with expensive batch jobs), anywhere you need predictable behavior under partial overload.

Saga

Coordinate a distributed transaction with compensations instead of locks.

Problem

A business operation spans multiple services, each with its own database. Two-phase commit is rarely practical across heterogeneous services (XA support is uneven, blocking semantics are unacceptable at scale), but you still need an end state that's consistent.

Shape

Break the operation into local transactions, one per service. Each step records what it did. If a later step fails, run compensating actions (refund the charge, release the inventory) to undo the earlier steps.

Watch for

Compensations aren't rollbacks — the world has moved on between steps. Sagas have no isolation between steps either: other transactions can read partial state ("dirty reads"), so you need semantic locks or commutative updates where it matters. Choreographed sagas (services react to events) are loose; orchestrated sagas (a coordinator drives the steps) are easier to debug.

When it fits: Multi-service workflows where eventual consistency is acceptable: order placement, account onboarding, anything spanning payment + inventory + fulfillment.

Outbox

Publish events reliably by writing them to your DB in the same transaction as the data.

Problem

You update the database and then publish an event. If the publish fails, your data and the rest of the world disagree. If the publish succeeds but the DB rolls back, the same.

Shape

Write the event to an "outbox" table inside the same transaction as the data change. A separate process polls (or tails the WAL) and publishes the events to the broker, marking them sent. The two stores never disagree.

Watch for

Outbox tables grow without aggressive cleanup. Polling is simpler; CDC (log-tailing) is faster but couples you to the database's replication-log format (binlog, WAL, oplog). Outbox guarantees the event is published at least once — consumers must dedupe with an idempotency key, or the rest of the system will see ghost duplicates.

When it fits: Anywhere you publish events derived from DB writes and "lost message" or "ghost message" is unacceptable. Pairs naturally with sagas.

Event Sourcing

Store the log of changes; derive current state by replaying it.

Problem

The current state of a row tells you what's true now, but not how you got there. Audits, debugging, and "what did the user see at 3pm yesterday?" become guesswork.

Shape

Persist every change as an immutable event. Current state is a projection — fold the events forward whenever you need it. Snapshots speed up replay for long-lived aggregates.

Watch for

Schema evolution becomes versioning every event type forever. Replays slow down without snapshots. History is hard to undo: a wrong event needs a correcting event, not an edit — which collides with regulatory deletion rights (GDPR right-to-erasure), so plan a cryptographic-shredding or tombstone strategy up front.

When it fits: Domains where the log itself has business value — finance, audit-heavy systems, collaborative editing, anywhere "how did we get here?" is a real question.

CQRS

Split the read model from the write model so each can be optimized independently.

Problem

Reads and writes have wildly different shapes. Forcing both through one model means writes are slower than they need to be, reads do too many joins, and one schema serves neither well.

Shape

Two models. The write side accepts commands, validates them, and persists the change. The read side maintains denormalized projections optimized for queries, updated from the write side (often via events).

Watch for

Eventual consistency between write and read. You also doubled the moving parts — two schemas, two code paths, an integration between them. Don't reach for it just because the term sounds clever.

When it fits: Read-heavy systems with complex query needs; pairs naturally with event sourcing; useful when read and write workloads need to scale independently.

Patterns aren't goals

Reach for a pattern when its problem is in front of you. Reaching for one because you read about it last week is how you end up with a saga coordinating a single-service operation, or an event log no one ever queries.

Most systems need fewer patterns than their engineers want to apply. The win is recognizing the question — not memorizing the answer.