At some point in an SRE loop, an interviewer will ask you to explain what happens after a 5xx spike. Not abstractly. Walk me through it: what you check first, what tool you open, what you do if the first hypothesis is wrong. How you answer that question tells the interviewer more about your operational instincts than almost anything else in the interview.
This is a question bank for that kind of interview. Not trivia. Not definitions. The questions below are the ones that actually distinguish candidates who’ve operated systems under pressure from candidates who’ve read about it.
Fundamentals questions (expect these in every loop)
These come up in phone screens and early rounds. Get them wrong and you don’t move forward.
- What’s the difference between an SLI, SLO, and SLA? SLI is a measurement. SLO is your internal target. SLA is the external promise. The key detail: SLAs should sit below your SLO or you’ll violate customer contracts before you even know you have a problem.
- What is an error budget? The gap between 100% and your SLO. If your SLO is 99.9%, your monthly error budget is about 43 minutes. The idea is to treat that budget as a shared resource between reliability and feature velocity.
- What is toil? Manual, repetitive, automatable work that scales linearly with service growth. The Google SRE book argues teams should keep toil under 50% of their time. In practice, getting below 40% takes sustained effort. Claiming your team had zero toil will raise eyebrows.
- What’s the difference between monitoring and observability? Monitoring is predefined metrics and alerts. Observability is the ability to infer internal system state from external outputs, typically through a combination of logs, metrics, and distributed traces. A system can be heavily monitored but not observable if you can’t debug novel failures.
- What makes an alert good vs. bad? Good alerts are actionable and urgent. Bad alerts are informational pages at 3 AM for things that can wait until morning. Alert fatigue is a real operational risk.
System design for reliability
These questions are open-ended and have no single right answer. Interviewers are watching how you reason, not whether you arrive at a specific architecture.
- Design a global rate limiter for an API that handles 500,000 requests per second. Push candidates toward token buckets or sliding window counters, the tradeoff between exact rate limiting and approximate rate limiting with Redis, and what happens to availability if the rate limiter itself goes down.
- How would you design an alerting system that avoids page fatigue? Key ideas: separate “page now” from “ticket for tomorrow,” dynamic thresholds over static ones, alert correlation to group related pages.
- Walk me through designing for 99.99% availability. Note that 99.99% is roughly 52 minutes of allowed downtime per year. That requires multi-region active-active deployments, zero-downtime deploys, and exhaustive automated rollback. Ask what the business actually needs before defaulting to the highest number.
- How do you handle cascading failures? Circuit breakers, bulkheads, timeout hierarchies, graceful degradation. The textbook answer is fine here but add a real example if you have one.
Incident management questions
This is where candidates who’ve actually been on call separate from candidates who’ve studied for on call.
- Walk me through your incident response process. Detection, severity classification, incident commander assignment, customer communication, mitigation, resolution, postmortem. The postmortem follow-through is where most answers fall flat.
- How do you run a blameless postmortem? The honest answer is that truly blameless postmortems are hard in cultures with high individual accountability. The mechanics: focus on systems and processes, not people. Five whys on the contributing factors. Action items with owners and deadlines. The test: would the engineer involved feel comfortable presenting this to the whole team?
- An on-call alert fires at 2 AM. Your first hypothesis is wrong. What do you do? The key: document your steps even at 2 AM. Don’t thrash randomly. Form a new hypothesis based on what you ruled out. Know your escalation path before you’re on call, not during.
- You have 80% of your error budget burned in the first week of the month. What do you do? Investigate whether the burn represents real user impact or measurement artifacts. Have a conversation with product and engineering leadership about priorities. Don’t unilaterally halt feature work without data and alignment.
Coding and scripting
SRE coding interviews are different from software engineering interviews. They lean heavily on Python or Go scripting, Linux internals, and debugging real output.
- Write a script that checks whether a service is healthy every 30 seconds and pages if it fails 3 times in a row. Tests basic loop control, HTTP client usage, and state management. In Python, probably a while loop with a consecutive failure counter and a notification call.
- Given this server log snippet, what’s failing and why? They’ll hand you 20 lines of logs. Read them line by line. Identify timestamps, error codes, and correlation IDs. “I’m not sure yet, but the 503 cluster between 03:42:07 and 03:42:19 lines up with the connection pool exhaustion error two lines earlier” is a good start.
- How would you find what process is consuming the most memory on a Linux server?
ps aux --sort=-%mem | head -20ortopsorted by memory. Bonus:/proc/<pid>/statusfor per-process memory detail. - Explain how you’d debug a service that’s slow intermittently but fine most of the time. Distributed tracing is the right tool here. Identify which percentile the slowness shows up in (p95? p99?). Check for GC pauses, lock contention, or noisy-neighbor effects in shared infrastructure.
Capacity planning and performance
- How do you decide how much headroom to build into your capacity plan? Standard guidance: 30 to 50% above expected peak. The nuance is defining “expected peak.” Measure over 4-plus weeks, identify seasonal patterns, model event-driven spikes separately.
- Your database CPU has been running at 70% for three weeks. What do you do? Don’t add hardware as the first move. Profile query patterns, identify slow queries, check index usage. Adding CPU to a query with a missing index just makes the bad query run faster before the next spike hits.
- How do you approach load testing before a major launch? Baseline current system behavior under realistic load. Use production traffic shapes, not synthetic uniform load. Test at 1.5x, 2x, and 3x expected peak. Document failure modes and the specific thresholds where degradation starts.
Automation and toil reduction
- Describe a piece of toil you automated. What was the impact? The interviewer wants specifics. Time saved per week, error rate before and after, whether the automation itself created new maintenance burden.
- When is manual intervention better than automation? A fair question with a non-obvious answer. Automation fails in novel situations where the system behavior is outside the training envelope of the automation logic. Having a human in the loop for major incidents involving data loss or security events is usually the right call.
- How do you manage configuration drift across a fleet of 500 servers? Infrastructure as code (Terraform, Ansible, Chef), immutable infrastructure patterns, regular convergence runs with alerting on drift detected.
The Stack Overflow Developer Survey 2024 found that reliability engineering and platform engineering continue to rank among the highest-compensating specializations in software. Demand is strong partly because the role is hard to hire for: the combination of operational judgment, systems knowledge, and coding skill is genuinely rare.
According to the Bureau of Labor Statistics, software developer roles including SRE and DevOps specializations are projected to grow 25% through 2032, faster than nearly any other occupation category.
If you want to practice the open-ended incident questions specifically, Craqly’s interview mode generates scenario-based SRE questions and gives structured feedback on your answers. The system design and incident response formats are especially good for practicing the kind of multi-turn reasoning these interviews require.
Which of these categories do you find hardest to prepare for?