SRE Interview Help: Top Questions on Reliability Engineering

A payment pipeline went down at 2 AM on a Saturday. The on-call engineer spotted it via a latency alert, not a customer complaint. That distinction matters a lot in an SRE interview. How you talk about that scenario, what you measured, why you set the alert threshold where you did, is often what separates a hire from a pass.

SRE interviews test a specific kind of thinking. They’re not purely systems design, and they’re not purely coding. They sit at the intersection of operational judgment and engineering discipline. I’ve seen engineers with strong distributed systems backgrounds stumble because they couldn’t articulate the difference between an SLI and an SLO. I’ve also seen candidates with modest system design skills land senior SRE roles because their incident reasoning was sharp.

Here are the questions that actually come up, with what interviewers are listening for.

SLOs, SLIs, and the SLA question everyone botches

Interviewers almost always start here. The definitions are easy enough: a Service Level Indicator (SLI) is a measured metric (latency, error rate, availability), a Service Level Objective (SLO) is the target you set internally, and a Service Level Agreement (SLA) is the external contract with consequences.

What trips people up is the relationship between them. A good answer sounds like: “Our SLI is the 99th percentile of successful requests measured over a rolling 28-day window. Our SLO is 99.9%, which gives us roughly 43 minutes of error budget per month. The SLA is what we promise paying customers, usually set 10 to 20 basis points below the SLO so we have room to react before violating a contract.”

The follow-up is usually: “How do you decide what availability number to put in the SLO?” The honest answer is that 99.99% sounds better than 99.9% but 99.99% only allows about 52 minutes of downtime per year. Most teams can’t sustain that without massive toil. Start slightly below what you’re currently achieving and tighten it as your systems improve.

Error budgets in practice

This is where SRE interviews get interesting. The theory is clean: if your SLO is 99.9%, you have a 0.1% error budget. Burn it too fast and you stop shipping features until reliability recovers. In practice, most organizations don’t actually enforce this policy. Interviewers know that. They’re testing whether you know it too.

Good candidates say something like: “The error budget framework is useful even if the enforcement is soft. It gives the team a shared language for the reliability-vs-velocity tradeoff. When we’re burning budget fast, we can point to the number instead of having a subjective argument about whether to slow down.”

A harder question: “Your team burned 80% of its monthly error budget in the first week. What do you do?” The answer isn’t “declare a freeze on feature work.” It’s an investigation into whether the budget burn reflects real user impact or measurement noise, followed by a conversation with product leadership about priorities. Unilaterally halting feature releases based on a metric is a good way to lose stakeholder trust fast.

Incident response: what the timeline reveals about your thinking

A typical prompt: “Walk me through how you handled a significant incident.” Interviewers aren’t looking for a war story. They’re looking for a structured approach to detection, triage, mitigation, and postmortem.

What makes an answer land well:

  • Detection: how did you know something was wrong? Was it an alert, a customer report, or something you noticed while on-call? Alerts are better than customer reports.
  • Triage: how did you narrow the blast radius? What signals told you it was the database and not the application layer?
  • Mitigation: what did you do to stop the bleeding before root cause was identified? Rollback, feature flag, reroute traffic?
  • Postmortem: what actually changed after the incident? “We did a blameless postmortem” is table stakes. What specific system or process was different afterward?

The postmortem question is the one I think most candidates answer too abstractly. Saying “we improved our monitoring” is not an answer. “We added a synthetic transaction test that runs every 90 seconds against the checkout flow, which would have caught this 14 minutes earlier than our latency alert did” is an answer.

Monitoring vs. observability

You will get this question. Monitoring is about collecting and alerting on predefined metrics. Observability is about building systems whose internal state you can infer from external outputs, typically logs, metrics, and traces together.

The subtext of this question is usually: “Can you design alert systems that don’t page engineers at 3 AM for things that don’t matter?” Alert fatigue is a real problem. The Stack Overflow Developer Survey 2024 found that on-call burnout consistently ranks among the top reasons experienced engineers leave their roles.

A practical answer: alerts should be actionable. If an engineer gets paged, there should be a defined next step they can take. “Disk usage is at 72%” is not actionable at 3 AM. “Disk usage will reach 100% in approximately 4 hours based on current write rate” is actionable and urgent.

Toil: the 50% rule and why it’s harder than it sounds

Google’s SRE book argues that engineers should spend no more than 50% of their time on toil. Toil is manual, repetitive, automatable work that scales with service growth but creates no lasting value. Running the same deployment script by hand every week is toil. Writing a deployment pipeline that does it for you is not.

The 50% figure is aspirational for most teams. I don’t think I’ve worked on or spoken to a team that’s consistently below 50% toil without significant investment in internal tooling. That’s worth saying in an interview, actually. Candidates who claim their previous team had zero toil raise skepticism.

A better answer: “We tracked toil explicitly in our sprint planning and tried to allocate one or two toil-reduction tickets per sprint. We got from about 60% to 40% over six months. We never got to 20%, partly because the business kept adding new services and each new service came with its own operational overhead before automation caught up.”

Capacity planning without the obvious answer

The standard guidance is to maintain 30 to 50% headroom above expected peak load. That’s correct. The more interesting question is how you decide what “expected peak load” means.

Traffic patterns in production are rarely stable. A B2B SaaS product might have predictable weekday peaks and near-zero weekend traffic. A consumer app tied to a TV schedule might need to absorb 40x normal traffic in a 4-minute window. The right answer to “how do you capacity plan?” is not a formula. It’s a process: measure over at least 4 weeks, identify seasonal and event-driven patterns, load test against peak estimates with a defined multiplier, and build autoscaling with a tested upper bound.

Interviewers also ask about cost. Infinite capacity headroom is not a real strategy. The tradeoff between over-provisioning for reliability and under-provisioning for cost efficiency is an engineering decision, not just an ops decision. Candidates who talk about cost get credit for understanding the full picture.

On-call design and sustainable rotations

This comes up in senior interviews more than junior ones. A rotation with fewer than 5 engineers means someone is on call more than once every 5 to 6 weeks, which most people find exhausting. But small teams don’t always have the headroom to get to 5.

The real question interviewers are probing is: “Do you understand that on-call sustainability affects hiring, retention, and system quality?” Engineers who burn out on-call leave. When they leave, the rotation shrinks. When the rotation shrinks, the remaining engineers get paged more. The cycle compounds. Designing systems that reduce unnecessary pages is not just an operational concern; it’s a retention strategy. The BLS Occupational Outlook for software roles documents strong demand for reliability engineers, partly because turnover in ops-heavy roles runs high.

If you’re preparing for a senior SRE interview, Craqly’s AI interview mode can run you through multi-part incident scenario questions and give you real-time feedback on whether your answers hit the right frameworks. The structured feedback on your incident reasoning is more useful than flashcards for this kind of interview.

What’s the one question you consistently get wrong when you prep? That’s usually the one worth spending another hour on.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top