A field guide

Software architecture is the art of choosing your problems.

Every meaningful architectural decision is a trade-off. There are no universal best practices — only choices that fit your constraints better than the alternatives. This guide walks through the trade-offs you'll meet again and again, so you can argue with your team using the same vocabulary.

10 trade-offs · ~12 min read · Updated 2026

How to read this guide

Each section frames a single decision. You'll see the two (or three) options, the forces pulling each way, and a short note on when each choice tends to win. The right answer is almost always "it depends" — the goal here is to make the dependencies explicit.

Conway's Law: Any organization that designs a system will produce a design whose structure is a copy of the organization's communication structure. Architecture is downstream of people.

Monolith vs. Microservices

One deployable unit, or many small ones?

Monolith

A single codebase, a single deployment artifact, one shared database.

Simple to develop, test, and deploy locally.
Cross-cutting refactors are trivial — one repo, one type-checker.
No network between modules: latency is a function call.
One source of truth for transactions and consistency.

Scaling is all-or-nothing — you scale the whole process.
One bad deploy breaks every feature.
Coupling sneaks in unless you actively police boundaries.
Polyglot teams are hard: everyone shares one runtime.

Microservices

Many independent services, each owning their data, talking over the network.

Teams ship independently — no coordinated release trains.
Scale only the services that need it.
Fault isolation: one service degrading doesn't kill the rest.
Pick the right tool per service (language, datastore).

Distributed systems are hard: partial failures, retries, idempotency.
Cross-service transactions become sagas.
Heavy ops investment: tracing, service discovery, contracts.
Local dev gets harder; "run it on your laptop" stops being true.

When to pick what: Start with a modular monolith. Split out a service only when a clear scaling, ownership, or release-cadence problem appears that the monolith demonstrably can't solve. Premature microservices are the most expensive mistake on this list.

SQL vs. NoSQL

Relational rigor, or flexible scale?

Relational (SQL)

Strong schemas, ACID transactions, joins, decades of tooling.

Constraints catch bugs at the database layer.
Joins make new query patterns cheap to add later.
ACID transactions remove a whole class of consistency bugs.
Mature: Postgres, MySQL — battle-tested at any scale you'll see.

Horizontal scaling is genuinely hard (sharding, replicas).
Schema migrations are scary on huge tables.
Object-relational mismatch — ORMs are leaky.

Non-relational (NoSQL)

Document, key-value, wide-column, graph — pick the shape that fits the workload.

Scale-out is the default, not the escape hatch.
Schema-on-read fits rapidly evolving data — for document stores. (Wide-column and key-value still demand careful schema work.)
Specialized stores for specialized loads (timeseries, search, graph).

Weaker consistency guarantees by default.
Joins move to the application layer — query planning is your job.
Access patterns get baked into the data model: re-modeling on a Cassandra/DynamoDB schema after the fact is genuinely expensive.
Each NoSQL store is its own ecosystem; transferable knowledge is lower.

When to pick what: Default to Postgres. It scales further than people think, and JSON columns give you NoSQL flexibility when you need it. Reach for a specialized store when the workload is clearly one shape (timeseries, full-text search, graph traversal) that Postgres handles poorly.

Synchronous vs. Asynchronous Communication

Wait for the answer, or fire and forget?

Synchronous

Request, wait, get a response. HTTP, gRPC, direct DB calls.

Easy to reason about — read top to bottom.
Errors propagate naturally to the caller.
Simple debugging: one request, one trace.

Tight coupling: caller is blocked on callee's availability.
Cascading failures travel up the call stack.
Latency adds up across hops.

Asynchronous

Publish an event, move on. Queues, streams, pub/sub.

Producer and consumer scale independently.
Built-in buffer absorbs traffic spikes.
New consumers can be added without touching the producer.
Independent consumers fail independently — one crashing doesn't take the others down.

Eventual consistency: state is correct "soon," not now.
Debugging means following events across systems.
Ordered streams (Kafka partitions, SQS FIFO) introduce head-of-line blocking — one slow message holds up everything behind it.
Idempotency and ordering are now your problem. "Exactly-once delivery" is impossible across a network; what you actually build is at-least-once with idempotent processing.

When to pick what: Sync for read paths and anywhere the user is waiting on the answer. Async for write fan-out, side effects, and any work that can tolerate a few seconds of delay. Don't introduce a queue just because it sounds modern — every queue is a place data can get stuck.

REST vs. GraphQL vs. gRPC

Three takes on talking to a server.

REST

Resources, HTTP verbs, JSON. The default for public APIs.

Universally understood, infinitely cacheable via HTTP.
Stateless, debuggable with curl.
Tooling is everywhere.

Over- and under-fetching: clients take what's served.
N+1 round-trips for nested resources.
Versioning is a discipline, not a feature.

GraphQL

One endpoint, clients describe exactly the data they want.

Clients fetch exactly the fields they need — no more, no less.
Strong typed schema; great for varied client needs.
Single round-trip for deeply nested data.

Server complexity moves to resolvers and dataloaders.
HTTP caching takes effort — POST queries don't cache; persisted queries over GET (Apollo, Relay, Hasura) are how teams get CDN caching back.
Authorization at the field level is non-trivial.

gRPC

Binary protocol, schema-first, designed for service-to-service.

Compact binary wire format on HTTP/2 with multiplexing — many in-flight calls on one connection.
Bidirectional streaming as a first-class primitive, not a workaround.
Schema-first — generated clients across languages.
Built for internal RPCs at scale.

Browser support requires gRPC-Web — extra moving parts.
Harder to debug than plain HTTP+JSON.
Schema discipline is mandatory; breaking changes hurt.

When to pick what: REST for public-facing APIs and anywhere HTTP caching matters. GraphQL when you have many client shapes (mobile + web + partners) hitting overlapping data. gRPC for latency-sensitive service-to-service traffic inside your network.

Consistency vs. Availability (CAP)

During a partition, which do you sacrifice?

The CAP theorem says a distributed system can guarantee at most two of Consistency, Availability, and Partition tolerance. Since networks partition (it is not optional), the real choice is between CP and AP.

CP — Consistency over Availability

If we can't agree, we refuse to answer.

Reads always reflect the latest write — no stale data.
Required for money, inventory, identity, anything serial.

During a partition, parts of the system go offline.
Higher latency: writes coordinate before they confirm.

Examples: Spanner, etcd, ZooKeeper, CockroachDB. (A single-node Postgres isn't really a CAP system at all; multi-node Postgres replication is configurable rather than canonically CP.)

AP — Availability over Consistency

Always answer. We'll reconcile later.

System stays up through partitions.
Lower write latency — no global coordination.

Reads can be stale or conflicting.
Conflict resolution becomes the application's job.

Examples: Cassandra, Riak, DynamoDB at default settings (it now also offers strongly-consistent reads and ACID transactions when you ask for them).

When to pick what: Pick CP when the cost of wrong is higher than the cost of unavailable (money, medicine, auth). Pick AP when the cost of unavailable is higher than the cost of slightly stale (feeds, catalogs, social timelines).

Vertical vs. Horizontal Scaling

Bigger machine, or more machines?

Vertical (scale up)

Throw more CPU, RAM, and disk at one box.

No code changes — same app, bigger machine.
No distributed-systems tax: still one process.
Easier ops: one host to monitor, patch, restart.

Hard ceiling: at some point the biggest box isn't big enough.
Single point of failure unless you also replicate.
Costs scale super-linearly at the top end.

Horizontal (scale out)

Add more, smaller, identical machines behind a load balancer.

Effectively unbounded — keep adding nodes.
Failure of one node is recoverable.
Cheaper per unit of compute at scale.

Application has to be stateless, or state has to be externalized.
Adds load balancers, service discovery, and orchestration.
Datastore is usually the next bottleneck.

When to pick what: Vertical first. It's astonishing how far a single big Postgres + a single big app server can take you, and you skip an entire category of complexity. Go horizontal when vertical hits the wall — or when you need redundancy for availability.

Server-Side vs. Client-Side Rendering

Where does the HTML get built?

Server-Side Rendering (SSR)

HTML built on the server, sent ready-to-display.

Fast first paint — content visible before JS runs.
Reliably indexed by every crawler. Modern Googlebot does run JS, but social previews, news bots, and TTFB-sensitive ranking still reward server-rendered HTML.
Works on weak devices and slow networks.

Every navigation hits the server.
Interactive features still need a client runtime.
Server has to do per-request rendering work.

Client-Side Rendering (CSR)

Server sends a JS bundle; browser builds the UI.

App-like experience after the initial load.
Server is just a JSON API — clean separation.
Cheap CDN delivery for the static bundle.

Slow first paint — blank screen until JS executes.
SEO needs extra work (prerender or hybrid).
Bundle size is a constant fight.

When to pick what: Mostly-static, content-heavy sites: SSG (static site generation) — pre-render at build, serve from a CDN. Dynamic pages with SEO needs: SSR, ideally with streaming or React Server Components. Logged-in, interaction-heavy apps where the user spends a long session: CSR. Modern frameworks (Next, Remix, SvelteKit, Astro) let you pick per route, which is usually the right answer.

Stateful vs. Stateless Services

Does the server remember you between requests?

Stateful

The server keeps session, connection, or in-memory state.

Lower latency — state is right there in memory.
Natural fit for long-lived connections (websockets, games).
Simpler protocol for some workloads.

Sticky routing required — load balancers get more complex.
Restarting a node loses or migrates state.
Scaling out means moving state, which is hard.

Stateless

Every request carries everything it needs. State lives elsewhere.

Any node can serve any request — trivial load balancing.
Restarts and rolling deploys are painless.
Horizontal scaling just works.

State has to live somewhere — usually the DB or cache.
More serialization, more network hops.
Token-based auth shifts complexity to clients.

When to pick what: Default to stateless application servers. Push state into a datastore designed for it. Reach for stateful when latency, connection model, or workload (real-time, multiplayer, streaming) genuinely demands it.

Cache Strategies

Fast and stale, or fresh and slow?

Caching is the canonical performance lever, and the canonical source of bugs. The trade-off isn't whether to cache — it's where the staleness lands and who deals with it.

Cache-aside

App checks cache first; on miss, reads DB and populates cache.

Simple and explicit — the app is in control.
Cache failures are recoverable — fall through to DB.

Stampedes on cold cache without protection.
Invalidation is the app's job and easy to forget.

Write-through

Writes go to cache and DB together.

Cache is always warm and consistent with DB.
Reads are fast and reliably correct.

Slower writes — two systems on the hot path.
Caching everything you write may be wasteful.

Write-behind

Writes hit cache; cache flushes to DB asynchronously.

Very fast writes — DB is off the hot path.
Absorbs write spikes well.

Crash before flush = data loss.
Read-your-writes from elsewhere isn't guaranteed.

The two hard things: Phil Karlton's adage holds — cache invalidation and naming things. Pick the simplest strategy you can defend, and instrument hit/miss/staleness from day one. Beyond the three above, read-through (cache loads from DB on miss, not the app) and refresh-ahead (proactively reload hot keys before they expire) are the standard variants for read-heavy workloads.

Build vs. Buy

Your own infrastructure, or somebody else's?

Build

Implement and operate it yourself.

Total control over behavior, cost, and roadmap.
No vendor lock-in.
Can be a real competitive advantage if it's core to your product.

You own every outage, upgrade, and edge case.
Engineering time spent on plumbing isn't spent on product.
Slow to reach feature parity with mature vendors.

Buy

Use a SaaS, managed service, or off-the-shelf product.

Production-grade in a day, not a year.
Vendor's ops team handles uptime, scaling, security.
Engineering time stays on differentiating work.

Cost grows with usage, sometimes faster than expected.
Vendor decides the roadmap; you don't.
Migration off is painful — pick wisely.

When to pick what: Build what is your product. Buy everything else, especially undifferentiated heavy lifting (auth, email, observability, payments). The most expensive in-house systems I've seen are bad copies of products you could license for less than one engineer's monthly salary.

The meta-trade-off

Every choice on this page costs you something. The skill isn't picking the option without downsides — there isn't one. It's picking the option whose downsides you can afford, in this system, for these users, at this stage of the company.

Architecture is a verb, not a noun. Re-evaluate every choice when the constraints change.