Most MLOps interview questions aren’t about algorithms. They’re about what happens when your model works fine in training and then quietly degrades in production over three months without a single error log. That’s the problem MLOps was built to solve, and companies hiring for it want to know if you’ve actually dealt with it.
This guide covers the questions that show up consistently, the framing interviewers are looking for, and a few places where I think the standard prep advice gets it wrong.
What interviewers are actually evaluating
There’s a common assumption that MLOps interviews are DevOps interviews with a few ML-flavored questions added in. That’s not really accurate. The core challenge in MLOps is that the “code” (your model) has hidden state that evolves with the data it’s trained on. Standard software CI/CD doesn’t have a good answer for “what if the logic changes because the world changed?” MLOps does.
Good interviewers are looking for whether you understand that distinction. They want to see that you’ve thought about data as a first-class concern, not just as an input to your pipeline. That shapes how they’ll evaluate almost every answer you give.
Pipeline automation questions
Q: How do you decide when to retrain a model?
The weak answer: “When accuracy drops.” The stronger answer distinguishes between schedule-based retraining (retrain every N days regardless), performance-based retraining (retrain when a metric crosses a threshold), and trigger-based retraining (retrain when upstream data distribution shifts beyond a defined bound, even before performance degrades visibly). In practice, the best systems combine all three, with data drift as the earliest warning signal.
Q: What’s the difference between batch and streaming inference pipelines and when would you choose each?
Batch: you process accumulated data periodically (hourly, daily). Lower operational complexity, higher latency. Good for recommendation refreshes, overnight scoring runs. Streaming: you process each event as it arrives. Higher complexity, near-real-time outputs. Good for fraud detection, real-time personalization. The real answer includes the cost and complexity of managing stateful streaming infrastructure, which often makes batch pipelines the right default until the latency requirement is proven.
Q: How would you validate new training data before it enters a pipeline?
Describe schema validation (correct types, expected ranges, no unexpected nulls), distribution checks against a baseline (are the feature distributions within expected bounds?), and referential integrity if the data joins other sources. Tools like Great Expectations or TFX’s data validation library are worth mentioning if you’ve used them.
Model deployment and serving
Q: Walk me through a blue-green deployment for a model update.
You run two identical serving environments, one live (blue) and one idle (green). You deploy the new model to green, run validation tests, then shift traffic to green. Blue stays up as a rollback target. The MLOps-specific wrinkle: you need to validate model behavior on live traffic, not just static test sets, before committing. A/B testing or canary release (gradually shifting traffic from 1% to 5% to 50%) gives you that validation.
Q: What’s shadow mode deployment and when would you use it?
Shadow mode runs the new model alongside the production model, making predictions on the same requests but not serving them to users. You accumulate real-world prediction data and compare. Use it when you can’t afford even a 1% error rate from a new model and need high confidence before any live traffic shifts. The tradeoff: it doubles your inference cost and you can’t observe real user responses to the shadow model’s outputs.
Q: How do you handle model versioning across data pipelines?
At minimum: version your model artifacts with a hash tied to the training data version and the code version that produced them. When something goes wrong in production, you need to be able to reproduce exactly what model was live at a given time. MLflow, DVC, and similar tools make this tractable. Without versioning, debugging production incidents becomes archaeology.
Drift detection: the questions that separate candidates
Q: What’s the difference between data drift, concept drift, and prediction drift?
Data drift: the distribution of input features changes. A model trained on 2022 transaction data starts seeing 2025 transaction patterns. Concept drift: the underlying relationship between features and labels changes. A fraud detection model built before a new fraud technique appears. Prediction drift: the model’s output distribution shifts, which may or may not reflect a real problem. Prediction drift is often the first thing you can monitor, but it’s a symptom, not a root cause.
Q: How would you set up monitoring for a recommendation model?
This is a design question. Strong answers cover: business metrics (click-through rate, conversion) as lagging indicators, prediction distribution as a leading indicator, feature store health checks as an upstream signal, and offline evaluation on a held-out test set updated on a regular cadence. Most candidates cover one or two of these. Covering all four and explaining which you’d alert on vs. log-only is what lands.
CI/CD for ML: what it actually involves
Q: How does CI/CD for ML differ from standard software CI/CD?
Standard CI/CD tests code behavior, which is deterministic. ML CI/CD also needs to test model quality, which is probabilistic and data-dependent. A model that passes unit tests can still degrade on new data. So ML pipelines typically add a model evaluation gate: the new model artifact has to exceed a baseline on a held-out validation set before it can proceed. Some teams also add business metric gates, requiring that a staging canary meets a conversion or engagement threshold before full rollout.
Q: A model passes all automated quality gates but business metrics drop 72 hours after deployment. How do you debug it?
This is an open-ended question and there’s no single right answer. What they’re looking for: a structured debugging approach. Start by confirming whether the business metric drop is correlated with the model deployment or coincident with another change (new app version, seasonal traffic shift, A/B test running in parallel). If correlated, compare prediction distributions before and after deployment. Check whether evaluation set was truly representative of current production traffic. Look at tail cases, not just average-case performance. Consider rollback while you investigate.
Feature stores and data management
Q: What problem does a feature store solve?
The training-serving skew problem. Your training pipeline computes features one way. Your serving pipeline computes them another way. Subtle differences accumulate and your model sees different inputs in production than it trained on. A feature store centralizes feature computation so both pipelines use the same logic. Secondary benefits: feature reuse across models, point-in-time correct joins for training (you can reconstruct what features looked like at the time of a historical event).
How to prep for the open-ended design questions
The Stack Overflow Developer Survey 2024 found that ML and data engineering are among the fastest-growing specializations in terms of salary and job postings, which tracks with how rigorously companies are now interviewing for MLOps roles specifically. The bar has gone up since 2021.
The BLS projects computer and information research scientist employment to grow 26% through 2033, faster than nearly any other occupation. MLOps sits at the intersection of that demand.
For design questions, practice giving a structured answer that covers: scale requirements, failure modes, monitoring strategy, and what you’d cut in an MVP vs. what you’d add in v2. The SCALABLE acronym that floats around MLOps prep communities is fine as a mnemonic, but the interviewers can tell when you’re working through a template vs. when you’ve actually thought about the system.
If you want to run through MLOps system design questions with a mock interviewer that gives real-time feedback on your structure and coverage, Craqly’s AI interview copilot supports technical design practice across engineering disciplines including ML systems.
What’s the MLOps concept you feel least solid on going into your interview?