Software architecture is the art of choosing your problems.
Every meaningful architectural decision is a trade-off. There are no
universal best practices — only choices that fit your constraints
better than the alternatives. This guide walks through the trade-offs
you'll meet again and again, so you can argue with your team using the
same vocabulary.
10 trade-offs·~12 min read·Updated 2026
How to read this guide
Each section frames a single decision. You'll see the two (or three)
options, the forces pulling each way, and a short note on when each
choice tends to win. The right answer is almost always
"it depends" — the goal here is to make the dependencies
explicit.
Conway's Law: Any organization that designs a system
will produce a design whose structure is a copy of the organization's
communication structure. Architecture is downstream of people.
01
Monolith vs. Microservices
One deployable unit, or many small ones?
Monolith
A single codebase, a single deployment artifact, one shared database.
Simple to develop, test, and deploy locally.
Cross-cutting refactors are trivial — one repo, one type-checker.
No network between modules: latency is a function call.
One source of truth for transactions and consistency.
Scaling is all-or-nothing — you scale the whole process.
One bad deploy breaks every feature.
Coupling sneaks in unless you actively police boundaries.
Polyglot teams are hard: everyone shares one runtime.
Microservices
Many independent services, each owning their data, talking over the network.
Teams ship independently — no coordinated release trains.
Scale only the services that need it.
Fault isolation: one service degrading doesn't kill the rest.
Pick the right tool per service (language, datastore).
Distributed systems are hard: partial failures, retries, idempotency.
Cross-service transactions become sagas.
Heavy ops investment: tracing, service discovery, contracts.
Local dev gets harder; "run it on your laptop" stops being true.
When to pick what: Start with a modular monolith. Split
out a service only when a clear scaling, ownership, or release-cadence
problem appears that the monolith demonstrably can't solve. Premature
microservices are the most expensive mistake on this list.
02
SQL vs. NoSQL
Relational rigor, or flexible scale?
Relational (SQL)
Strong schemas, ACID transactions, joins, decades of tooling.
Constraints catch bugs at the database layer.
Joins make new query patterns cheap to add later.
ACID transactions remove a whole class of consistency bugs.
Mature: Postgres, MySQL — battle-tested at any scale you'll see.
Horizontal scaling is genuinely hard (sharding, replicas).
Schema migrations are scary on huge tables.
Object-relational mismatch — ORMs are leaky.
Non-relational (NoSQL)
Document, key-value, wide-column, graph — pick the shape that fits the workload.
Scale-out is the default, not the escape hatch.
Schema-on-read fits rapidly evolving data — for document stores. (Wide-column and key-value still demand careful schema work.)
Specialized stores for specialized loads (timeseries, search, graph).
Weaker consistency guarantees by default.
Joins move to the application layer — query planning is your job.
Access patterns get baked into the data model: re-modeling on a Cassandra/DynamoDB schema after the fact is genuinely expensive.
Each NoSQL store is its own ecosystem; transferable knowledge is lower.
When to pick what: Default to Postgres. It scales
further than people think, and JSON columns give you NoSQL flexibility
when you need it. Reach for a specialized store when the workload is
clearly one shape (timeseries, full-text search, graph traversal) that
Postgres handles poorly.
03
Synchronous vs. Asynchronous Communication
Wait for the answer, or fire and forget?
Synchronous
Request, wait, get a response. HTTP, gRPC, direct DB calls.
Easy to reason about — read top to bottom.
Errors propagate naturally to the caller.
Simple debugging: one request, one trace.
Tight coupling: caller is blocked on callee's availability.
Cascading failures travel up the call stack.
Latency adds up across hops.
Asynchronous
Publish an event, move on. Queues, streams, pub/sub.
Producer and consumer scale independently.
Built-in buffer absorbs traffic spikes.
New consumers can be added without touching the producer.
Independent consumers fail independently — one crashing doesn't take the others down.
Eventual consistency: state is correct "soon," not now.
Debugging means following events across systems.
Ordered streams (Kafka partitions, SQS FIFO) introduce head-of-line blocking — one slow message holds up everything behind it.
Idempotency and ordering are now your problem. "Exactly-once delivery" is impossible across a network; what you actually build is at-least-once with idempotent processing.
When to pick what: Sync for read paths and anywhere
the user is waiting on the answer. Async for write fan-out, side
effects, and any work that can tolerate a few seconds of delay. Don't
introduce a queue just because it sounds modern — every queue is a
place data can get stuck.
04
REST vs. GraphQL vs. gRPC
Three takes on talking to a server.
REST
Resources, HTTP verbs, JSON. The default for public APIs.
Universally understood, infinitely cacheable via HTTP.
Stateless, debuggable with curl.
Tooling is everywhere.
Over- and under-fetching: clients take what's served.
N+1 round-trips for nested resources.
Versioning is a discipline, not a feature.
GraphQL
One endpoint, clients describe exactly the data they want.
Clients fetch exactly the fields they need — no more, no less.
Strong typed schema; great for varied client needs.
Single round-trip for deeply nested data.
Server complexity moves to resolvers and dataloaders.
HTTP caching takes effort — POST queries don't cache; persisted queries over GET (Apollo, Relay, Hasura) are how teams get CDN caching back.
Authorization at the field level is non-trivial.
gRPC
Binary protocol, schema-first, designed for service-to-service.
Compact binary wire format on HTTP/2 with multiplexing — many in-flight calls on one connection.
Bidirectional streaming as a first-class primitive, not a workaround.
Schema-first — generated clients across languages.
Built for internal RPCs at scale.
Browser support requires gRPC-Web — extra moving parts.
Harder to debug than plain HTTP+JSON.
Schema discipline is mandatory; breaking changes hurt.
When to pick what: REST for public-facing APIs and
anywhere HTTP caching matters. GraphQL when you have many client
shapes (mobile + web + partners) hitting overlapping data. gRPC for
latency-sensitive service-to-service traffic inside your network.
05
Consistency vs. Availability (CAP)
During a partition, which do you sacrifice?
The CAP theorem says a distributed system can guarantee at most two of
Consistency, Availability, and Partition tolerance. Since networks
partition (it is not optional), the real choice is between CP and AP.
CP — Consistency over Availability
If we can't agree, we refuse to answer.
Reads always reflect the latest write — no stale data.
Required for money, inventory, identity, anything serial.
During a partition, parts of the system go offline.
Higher latency: writes coordinate before they confirm.
Examples: Spanner, etcd, ZooKeeper, CockroachDB. (A single-node Postgres isn't really a CAP system at all; multi-node Postgres replication is configurable rather than canonically CP.)
AP — Availability over Consistency
Always answer. We'll reconcile later.
System stays up through partitions.
Lower write latency — no global coordination.
Reads can be stale or conflicting.
Conflict resolution becomes the application's job.
Examples: Cassandra, Riak, DynamoDB at default settings (it now also offers strongly-consistent reads and ACID transactions when you ask for them).
When to pick what: Pick CP when the cost of wrong is
higher than the cost of unavailable (money, medicine, auth). Pick AP
when the cost of unavailable is higher than the cost of slightly
stale (feeds, catalogs, social timelines).
06
Vertical vs. Horizontal Scaling
Bigger machine, or more machines?
Vertical (scale up)
Throw more CPU, RAM, and disk at one box.
No code changes — same app, bigger machine.
No distributed-systems tax: still one process.
Easier ops: one host to monitor, patch, restart.
Hard ceiling: at some point the biggest box isn't big enough.
Single point of failure unless you also replicate.
Costs scale super-linearly at the top end.
Horizontal (scale out)
Add more, smaller, identical machines behind a load balancer.
Effectively unbounded — keep adding nodes.
Failure of one node is recoverable.
Cheaper per unit of compute at scale.
Application has to be stateless, or state has to be externalized.
Adds load balancers, service discovery, and orchestration.
Datastore is usually the next bottleneck.
When to pick what: Vertical first. It's astonishing
how far a single big Postgres + a single big app server can take you,
and you skip an entire category of complexity. Go horizontal when
vertical hits the wall — or when you need redundancy for availability.
07
Server-Side vs. Client-Side Rendering
Where does the HTML get built?
Server-Side Rendering (SSR)
HTML built on the server, sent ready-to-display.
Fast first paint — content visible before JS runs.
Reliably indexed by every crawler. Modern Googlebot does run JS, but social previews, news bots, and TTFB-sensitive ranking still reward server-rendered HTML.
Works on weak devices and slow networks.
Every navigation hits the server.
Interactive features still need a client runtime.
Server has to do per-request rendering work.
Client-Side Rendering (CSR)
Server sends a JS bundle; browser builds the UI.
App-like experience after the initial load.
Server is just a JSON API — clean separation.
Cheap CDN delivery for the static bundle.
Slow first paint — blank screen until JS executes.
SEO needs extra work (prerender or hybrid).
Bundle size is a constant fight.
When to pick what: Mostly-static, content-heavy
sites: SSG (static site generation) — pre-render at build,
serve from a CDN. Dynamic pages with SEO needs: SSR,
ideally with streaming or React Server Components. Logged-in,
interaction-heavy apps where the user spends a long session:
CSR. Modern frameworks (Next, Remix, SvelteKit, Astro)
let you pick per route, which is usually the right answer.
08
Stateful vs. Stateless Services
Does the server remember you between requests?
Stateful
The server keeps session, connection, or in-memory state.
Lower latency — state is right there in memory.
Natural fit for long-lived connections (websockets, games).
Simpler protocol for some workloads.
Sticky routing required — load balancers get more complex.
Restarting a node loses or migrates state.
Scaling out means moving state, which is hard.
Stateless
Every request carries everything it needs. State lives elsewhere.
Any node can serve any request — trivial load balancing.
Restarts and rolling deploys are painless.
Horizontal scaling just works.
State has to live somewhere — usually the DB or cache.
More serialization, more network hops.
Token-based auth shifts complexity to clients.
When to pick what: Default to stateless application
servers. Push state into a datastore designed for it. Reach for
stateful when latency, connection model, or workload (real-time,
multiplayer, streaming) genuinely demands it.
09
Cache Strategies
Fast and stale, or fresh and slow?
Caching is the canonical performance lever, and the canonical source
of bugs. The trade-off isn't whether to cache — it's
where the staleness lands and who deals with it.
Cache-aside
App checks cache first; on miss, reads DB and populates cache.
Simple and explicit — the app is in control.
Cache failures are recoverable — fall through to DB.
Stampedes on cold cache without protection.
Invalidation is the app's job and easy to forget.
Write-through
Writes go to cache and DB together.
Cache is always warm and consistent with DB.
Reads are fast and reliably correct.
Slower writes — two systems on the hot path.
Caching everything you write may be wasteful.
Write-behind
Writes hit cache; cache flushes to DB asynchronously.
Very fast writes — DB is off the hot path.
Absorbs write spikes well.
Crash before flush = data loss.
Read-your-writes from elsewhere isn't guaranteed.
The two hard things: Phil Karlton's adage holds —
cache invalidation and naming things. Pick the simplest strategy you
can defend, and instrument hit/miss/staleness from day one. Beyond
the three above, read-through (cache loads from DB on miss,
not the app) and refresh-ahead (proactively reload hot keys
before they expire) are the standard variants for read-heavy
workloads.
10
Build vs. Buy
Your own infrastructure, or somebody else's?
Build
Implement and operate it yourself.
Total control over behavior, cost, and roadmap.
No vendor lock-in.
Can be a real competitive advantage if it's core to your product.
You own every outage, upgrade, and edge case.
Engineering time spent on plumbing isn't spent on product.
Slow to reach feature parity with mature vendors.
Buy
Use a SaaS, managed service, or off-the-shelf product.
Production-grade in a day, not a year.
Vendor's ops team handles uptime, scaling, security.
Engineering time stays on differentiating work.
Cost grows with usage, sometimes faster than expected.
Vendor decides the roadmap; you don't.
Migration off is painful — pick wisely.
When to pick what: Build what is your product. Buy
everything else, especially undifferentiated heavy lifting (auth,
email, observability, payments). The most expensive in-house systems
I've seen are bad copies of products you could license for less than
one engineer's monthly salary.
The meta-trade-off
Every choice on this page costs you something. The skill isn't picking
the option without downsides — there isn't one. It's picking the option
whose downsides you can afford, in this system, for
these users, at this stage of the company.
Architecture is a verb, not a noun. Re-evaluate every choice when the
constraints change.