Reliability is the discipline of designing systems that fail
predictably and recover gracefully. These ten practices bridge
architecture and operations — they shape what your team does
before, during, and after the inevitable.
10 practices·~12 min read·Updated 2026
How to read this guide
Each practice follows the same shape: the problem it answers, the
operational pattern, and the failure modes when it's done badly.
The "When it fits" line at the end is when the practice
starts paying for itself.
Murphy's corollary: Everything that can fail will
fail, and the failures that matter most are the ones you didn't
plan for. The practices below are how teams convert surprises into
rehearsals.
01
SLO / SLI / SLA
Service-Level Indicators, Objectives, and Agreements — measuring reliability deliberately.
Problem
"Reliable" without numbers becomes whatever the on-call engineer says it is. Without targets, you over-invest in some areas and under-invest in others.
Shape
SLIs are what you measure (request latency, error rate). SLOs are targets over a window — "99.9% of requests under 200ms over a rolling 28 days" — and the gap between 100% and the SLO is your error budget. SLAs are the contractual promise to customers, usually weaker than the SLO and with financial or legal consequences when missed.
Watch for
100% is the wrong target. Error budgets make the trade-off explicit — when you're in budget, ship fast; when you're not, slow down. The standard math: 99% ≈ 3.65 days/year of downtime, 99.9% ≈ 8.77 hours, 99.99% ≈ 52.6 minutes, 99.999% ≈ 5.26 minutes. Alert on burn rate (multi-window, multi-burn-rate) so you catch fast-burn outages and slow leaks with one ruleset.
When it fits:
Any production system with users who notice when it breaks. Mature
teams use SLOs to drive engineering priorities, not just measure them.
02
Deployment Strategies
Blue-green, canary, rolling — and when each saves your weekend.
Problem
Big-bang deploys mean a bad release affects every user at once. Rollback under pressure is when most outages get worse.
Shape
Blue-green keeps two full environments and switches traffic. Canary releases to a small slice of users first. Rolling updates instances gradually. Each gives you a different cancel-the-deploy moment.
Watch for
Schema changes break blue-green if the two versions can't share the database. Canary needs real metrics to detect bad releases. Rolling can mask bugs that only show up at scale.
When it fits:
Any system with users. Pick the strategy that matches your traffic
shape, your data model, and your tolerance for partial-state during deploys.
03
Feature Flags
Decouple deploy from release; turn features off without redeploying.
Problem
Releasing a feature is risky; rolling it back requires another deploy. Long-lived branches diverge from main and become merge nightmares.
Shape
Wrap new behavior in a flag. Deploy with the flag off. Turn it on for internal users, then a small percentage, then everyone. Rollback is a config change, not a redeploy.
Watch for
Flags accumulate. A codebase with hundreds of stale flags is its own kind of debt. Set expiration dates and clean them up.
When it fits:
Any non-trivial change to user-facing behavior. Especially valuable
for migrations, experiments, and anything that might need to be
turned off at 3am.
04
Chaos Engineering
Break things on purpose to find what would break by accident.
Problem
Failures you haven't seen are failures you don't handle. By the time production teaches you, customers are paying the price.
Shape
State a steady-state hypothesis (the metric you expect to stay flat), then inject controlled failures in production-like environments — kill instances, slow networks, exhaust disks, drop messages. The discipline is the experiment, not the breakage: hypothesis, blast radius, abort criteria, observed result. Fix the surprises.
Watch for
Chaos in production needs strong observability and circuit breakers, or it becomes self-inflicted outage. Start in staging; graduate to limited production blast radius only when you're ready.
When it fits:
Distributed systems where failure modes are too numerous to enumerate.
Mature SRE organizations; high-availability systems where untested
failure paths are unacceptable.
Outages happen. Without a process, the response is slow, the wrong people get paged, and the same root cause hits twice.
Shape
On-call rotation with clear handoff. Severity tiers (SEV1/2/3) so paging policy matches actual user impact. Playbooks for known scenarios. Incident commander role to coordinate. Track MTTD (detect), MTTR (resolve / restore), and MTBF (between failures) — they tell you whether you're getting better at detection, response, or prevention. Blameless postmortem afterward — focus on systems and process, not people.
Watch for
Postmortems become finger-pointing if blame culture exists. Action items pile up if no one owns them. The same incident recurring is a process failure, not bad luck.
When it fits:
Any system that runs 24/7 with consequences when it fails. The
discipline scales from a 2-person team to thousands.
06
Capacity Planning
Projecting load and provisioning headroom before you need it.
Problem
Capacity surprises are the most expensive kind. Scaling up takes time; scaling under load doesn't work.
Shape
Track current usage, project growth (linear, seasonal, event-driven). Provision for projected peak plus margin. Load-test against the next 12 months of growth, not yesterday's traffic.
Watch for
Cloud autoscaling makes this feel automatic until it isn't — quotas, instance availability, and database limits don't elastically scale. Know your real ceiling, not the marketing ceiling.
When it fits:
Any system serving variable load. Especially critical before
launches, marketing campaigns, seasonal peaks, and migrations.
07
Backpressure & Rate Limiting
Fail predictably under overload instead of failing chaotically.
Problem
When demand exceeds capacity, queues grow unbounded, latency spirals, and the system collapses under its own retries. Without backpressure, overload becomes outage.
Shape
Three layers, related but distinct. Timeouts and deadlines are the foundation — bound every blocking call, propagate deadlines across services, never wait forever. Backpressure is flow control between stages: bounded queues, reactive-streams credit, TCP windowing — slow consumers slow producers instead of dropping work invisibly. Rate limiting rejects excess load at the edge. Pair with retries with exponential backoff and jitter to avoid retry storms. Circuit breakers and bulkheads sit alongside these to isolate failure, not as substitutes.
Watch for
Rate limits that are too tight throttle legitimate users. Backpressure that's too aggressive starves valid work. Tune against real traffic, not synthetic load.
When it fits:
Any system with shared resources and variable load. Especially
valuable for APIs, queues, and any service whose SLA is tighter
than its dependencies'.
08
Observability Strategy
Logs, metrics, and traces designed in, not bolted on.
Problem
You can only debug what you can see. Outages without observability are guesswork. Adding observability after an incident is too late.
Shape
Three pillars. Logs (events with context). Metrics (numbers over time). Traces (request paths across services). Add structured logging from day one; metrics for SLO measurement; traces for cross-service investigation.
Watch for
Cardinality explosion is the silent killer of metrics systems — be careful with high-cardinality labels. Logs without correlation IDs are searchable but not navigable.
When it fits:
Every system. The cost of adding observability after an outage is
the cost of one more outage with the same root cause.
09
Health Checks & Probes
Liveness, readiness, and startup checks done right.
Problem
A service that's running isn't necessarily serving. Without checks, load balancers send traffic to broken instances; orchestrators don't restart hung pods.
Shape
Liveness: am I alive (or should I be killed)? Readiness: am I ready for traffic? Startup: am I done initializing? Each answers a different question and routes a different action.
Watch for
A liveness check that depends on dependencies will mass-restart on dependency outages — making it worse. Keep liveness self-contained; let readiness check dependencies.
When it fits:
Anything running on Kubernetes, ECS, or behind a load balancer. The
check design is part of the deployment, not an afterthought.
10
Graceful Degradation
Keep working when parts fail; partial service beats no service.
Problem
A single dependency failure cascades into a full outage. Users see a broken homepage because the recommendations service is slow.
Shape
Identify which features are essential (must work) vs nice-to-have (degrade if dependency is down). Use timeouts, fallbacks, and cached values to keep the core working — and remember stale data has its own correctness cost (stale prices, stale auth), so bound the staleness deliberately. Show users a degraded experience, not an error page.
Watch for
Silent degradation hides real failures. Users may not notice the missing feature, but operators need to. Degradation should be observable and alert-able.
When it fits:
Any user-facing system with multiple backend dependencies. Anywhere
the cost of a partial outage is much less than a full one.
Reliability is invisible until it isn't
These ten practices are the difference between teams that get
called at 3am and teams that get called at 3am once. The investment
looks like overhead until you're inside an outage and the playbook
is the only reason it stays a small one.
Build the practice before you need it. The teams that do are the
ones whose services your customers don't notice — and that's the
whole point.