A field guide

Hope is not a strategy.

Reliability is the discipline of designing systems that fail predictably and recover gracefully. These ten practices bridge architecture and operations — they shape what your team does before, during, and after the inevitable.

10 practices · ~12 min read · Updated 2026

How to read this guide

Each practice follows the same shape: the problem it answers, the operational pattern, and the failure modes when it's done badly. The "When it fits" line at the end is when the practice starts paying for itself.

Murphy's corollary: Everything that can fail will fail, and the failures that matter most are the ones you didn't plan for. The practices below are how teams convert surprises into rehearsals.

SLO / SLI / SLA

Service-Level Indicators, Objectives, and Agreements — measuring reliability deliberately.

Problem

"Reliable" without numbers becomes whatever the on-call engineer says it is. Without targets, you over-invest in some areas and under-invest in others.

Shape

SLIs are what you measure (request latency, error rate). SLOs are targets over a window — "99.9% of requests under 200ms over a rolling 28 days" — and the gap between 100% and the SLO is your error budget. SLAs are the contractual promise to customers, usually weaker than the SLO and with financial or legal consequences when missed.

Watch for

100% is the wrong target. Error budgets make the trade-off explicit — when you're in budget, ship fast; when you're not, slow down. The standard math: 99% ≈ 3.65 days/year of downtime, 99.9% ≈ 8.77 hours, 99.99% ≈ 52.6 minutes, 99.999% ≈ 5.26 minutes. Alert on burn rate (multi-window, multi-burn-rate) so you catch fast-burn outages and slow leaks with one ruleset.

When it fits: Any production system with users who notice when it breaks. Mature teams use SLOs to drive engineering priorities, not just measure them.

Deployment Strategies

Blue-green, canary, rolling — and when each saves your weekend.

Problem

Big-bang deploys mean a bad release affects every user at once. Rollback under pressure is when most outages get worse.

Shape

Blue-green keeps two full environments and switches traffic. Canary releases to a small slice of users first. Rolling updates instances gradually. Each gives you a different cancel-the-deploy moment.

Watch for

Schema changes break blue-green if the two versions can't share the database. Canary needs real metrics to detect bad releases. Rolling can mask bugs that only show up at scale.

When it fits: Any system with users. Pick the strategy that matches your traffic shape, your data model, and your tolerance for partial-state during deploys.

Feature Flags

Decouple deploy from release; turn features off without redeploying.

Problem

Releasing a feature is risky; rolling it back requires another deploy. Long-lived branches diverge from main and become merge nightmares.

Shape

Wrap new behavior in a flag. Deploy with the flag off. Turn it on for internal users, then a small percentage, then everyone. Rollback is a config change, not a redeploy.

Watch for

Flags accumulate. A codebase with hundreds of stale flags is its own kind of debt. Set expiration dates and clean them up.

When it fits: Any non-trivial change to user-facing behavior. Especially valuable for migrations, experiments, and anything that might need to be turned off at 3am.

Chaos Engineering

Break things on purpose to find what would break by accident.

Problem

Failures you haven't seen are failures you don't handle. By the time production teaches you, customers are paying the price.

Shape

State a steady-state hypothesis (the metric you expect to stay flat), then inject controlled failures in production-like environments — kill instances, slow networks, exhaust disks, drop messages. The discipline is the experiment, not the breakage: hypothesis, blast radius, abort criteria, observed result. Fix the surprises.

Watch for

Chaos in production needs strong observability and circuit breakers, or it becomes self-inflicted outage. Start in staging; graduate to limited production blast radius only when you're ready.

When it fits: Distributed systems where failure modes are too numerous to enumerate. Mature SRE organizations; high-availability systems where untested failure paths are unacceptable.

Incident Response

Playbooks, on-call rotation, blameless postmortems.

Problem

Outages happen. Without a process, the response is slow, the wrong people get paged, and the same root cause hits twice.

Shape

On-call rotation with clear handoff. Severity tiers (SEV1/2/3) so paging policy matches actual user impact. Playbooks for known scenarios. Incident commander role to coordinate. Track MTTD (detect), MTTR (resolve / restore), and MTBF (between failures) — they tell you whether you're getting better at detection, response, or prevention. Blameless postmortem afterward — focus on systems and process, not people.

Watch for

Postmortems become finger-pointing if blame culture exists. Action items pile up if no one owns them. The same incident recurring is a process failure, not bad luck.

When it fits: Any system that runs 24/7 with consequences when it fails. The discipline scales from a 2-person team to thousands.

Capacity Planning

Projecting load and provisioning headroom before you need it.

Problem

Capacity surprises are the most expensive kind. Scaling up takes time; scaling under load doesn't work.

Shape

Track current usage, project growth (linear, seasonal, event-driven). Provision for projected peak plus margin. Load-test against the next 12 months of growth, not yesterday's traffic.

Watch for

Cloud autoscaling makes this feel automatic until it isn't — quotas, instance availability, and database limits don't elastically scale. Know your real ceiling, not the marketing ceiling.

When it fits: Any system serving variable load. Especially critical before launches, marketing campaigns, seasonal peaks, and migrations.

Backpressure & Rate Limiting

Fail predictably under overload instead of failing chaotically.

Problem

When demand exceeds capacity, queues grow unbounded, latency spirals, and the system collapses under its own retries. Without backpressure, overload becomes outage.

Shape

Three layers, related but distinct. Timeouts and deadlines are the foundation — bound every blocking call, propagate deadlines across services, never wait forever. Backpressure is flow control between stages: bounded queues, reactive-streams credit, TCP windowing — slow consumers slow producers instead of dropping work invisibly. Rate limiting rejects excess load at the edge. Pair with retries with exponential backoff and jitter to avoid retry storms. Circuit breakers and bulkheads sit alongside these to isolate failure, not as substitutes.

Watch for

Rate limits that are too tight throttle legitimate users. Backpressure that's too aggressive starves valid work. Tune against real traffic, not synthetic load.

When it fits: Any system with shared resources and variable load. Especially valuable for APIs, queues, and any service whose SLA is tighter than its dependencies'.

Observability Strategy

Logs, metrics, and traces designed in, not bolted on.

Problem

You can only debug what you can see. Outages without observability are guesswork. Adding observability after an incident is too late.

Shape

Three pillars. Logs (events with context). Metrics (numbers over time). Traces (request paths across services). Add structured logging from day one; metrics for SLO measurement; traces for cross-service investigation.

Watch for

Cardinality explosion is the silent killer of metrics systems — be careful with high-cardinality labels. Logs without correlation IDs are searchable but not navigable.

When it fits: Every system. The cost of adding observability after an outage is the cost of one more outage with the same root cause.

Health Checks & Probes

Liveness, readiness, and startup checks done right.

Problem

A service that's running isn't necessarily serving. Without checks, load balancers send traffic to broken instances; orchestrators don't restart hung pods.

Shape

Liveness: am I alive (or should I be killed)? Readiness: am I ready for traffic? Startup: am I done initializing? Each answers a different question and routes a different action.

Watch for

A liveness check that depends on dependencies will mass-restart on dependency outages — making it worse. Keep liveness self-contained; let readiness check dependencies.

When it fits: Anything running on Kubernetes, ECS, or behind a load balancer. The check design is part of the deployment, not an afterthought.

Graceful Degradation

Keep working when parts fail; partial service beats no service.

Problem

A single dependency failure cascades into a full outage. Users see a broken homepage because the recommendations service is slow.

Shape

Identify which features are essential (must work) vs nice-to-have (degrade if dependency is down). Use timeouts, fallbacks, and cached values to keep the core working — and remember stale data has its own correctness cost (stale prices, stale auth), so bound the staleness deliberately. Show users a degraded experience, not an error page.

Watch for

Silent degradation hides real failures. Users may not notice the missing feature, but operators need to. Degradation should be observable and alert-able.

When it fits: Any user-facing system with multiple backend dependencies. Anywhere the cost of a partial outage is much less than a full one.

Reliability is invisible until it isn't

These ten practices are the difference between teams that get called at 3am and teams that get called at 3am once. The investment looks like overhead until you're inside an outage and the playbook is the only reason it stays a small one.

Build the practice before you need it. The teams that do are the ones whose services your customers don't notice — and that's the whole point.