Data Science Interview Help: ML, Statistics, and Python Questions

Hello! I want to talk about the data science interview questions that actually show up in technical screens, because a lot of prep material covers the easy ones and glosses over the parts that trip people up in practice.

This is about statistics, ML concepts, and Python questions. Not case studies or behavioral questions (those are different). Just the technical content that shows up in the first or second round for most data science roles.

1. statistics questions

Statistics questions are the ones where I see the most confident-sounding wrong answers. Here are the ones worth really understanding, not just memorizing.

The p-value question. Interviewers ask “what is a p-value?” expecting the textbook answer but also wanting to see whether you understand the common misinterpretation. The textbook answer: assuming the null hypothesis is true, the p-value is the probability of observing a result at least as extreme as what you saw. The common wrong interpretation: “there’s a p% chance the null hypothesis is true.” Those are not the same thing. If you can explain that distinction clearly, in plain English, you’re ahead of most candidates.

Type I and Type II errors. Type I is a false positive (rejecting a true null hypothesis). Type II is a false negative (failing to reject a false null hypothesis). The follow-up question — “when would you prefer to minimize Type II errors over Type I errors?” — is a judgment question. Medical screening for a serious disease is the classic example: missing a real case (Type II) is worse than a false alarm (Type I) in that context. This kind of applied reasoning is what interviewers are looking for.

Central limit theorem. Know what it says: as sample size increases, the distribution of sample means approaches a normal distribution regardless of the underlying population distribution. Know why it matters: it’s why many statistical tests work even when your data isn’t normally distributed, as long as you have enough samples. What counts as “enough” is vague (often cited as 30, which is itself a rough rule of thumb), and admitting that is fine.

A/B test design questions. “How would you design an A/B test for this feature?” is a common question, and the strong answer mentions: statistical power, sample size calculation, choosing the right metric, the risk of running tests too long (peeking), and how you’d handle network effects if users can interact with each other. Most candidates cover the first two and miss the last three.

2. machine learning fundamentals

The bias-variance tradeoff is the single most common ML interview question I’ve encountered. The core: a high-bias model is too simple and underfits (it has systematic error regardless of training data). A high-variance model is too complex and overfits (it performs well on training data but poorly on new data). The tradeoff is that reducing one often increases the other. Regularization (L1, L2) reduces variance at the cost of some bias. That’s the framework most interviewers want to hear.

Random forests vs. gradient boosting. Both are ensemble methods but they work differently. Random forests train trees in parallel on bootstrap samples (bagging). Gradient boosting trains trees sequentially, with each tree correcting the errors of the previous one. In practice, gradient boosting (XGBoost, LightGBM) often outperforms random forests on tabular data but is slower to train and more sensitive to hyperparameters. Knowing when to reach for each is a signal of practical experience.

Handling imbalanced datasets. This comes up in every fraud detection, churn prediction, or medical diagnosis context. Common approaches: resampling (SMOTE for oversampling the minority class, random undersampling of the majority), class weighting in the loss function, and choosing the right metric (accuracy is misleading; use precision-recall or AUC-ROC). The deeper question is “what does a false negative cost in this domain?” — which gets at the real problem before you choose a method.

Model evaluation. Precision is the fraction of positive predictions that were correct. Recall is the fraction of actual positives that you caught. The F1 score balances them. Know when each matters. For spam detection, high precision matters more (you don’t want to filter real email). For cancer screening, high recall matters more (you don’t want to miss cases). Interviewers will sometimes give you a scenario and ask which metric to optimize — that judgment call is the real question.

3. Python questions that show up more than you’d expect

Python questions in data science interviews are usually not algorithmic coding (that’s more for ML engineering roles). They’re about pandas, NumPy, and occasionally production patterns.

pandas operations. “Given this DataFrame, compute a rolling 7-day mean grouped by user_id.” That’s a groupby + rolling question. If you haven’t used groupby().rolling() together, it’s worth practicing. The gotcha: rolling() with groupby requires a specific chaining order and the result behavior at group boundaries surprises people.

Memory efficiency. “You have a 20GB CSV and 16GB of RAM. How do you process it?” Options: chunked reading with pd.read_csv(chunksize=N), switching to a tool like Dask or Polars that processes lazily, or using a database. Most candidates give one answer. The strong answer reasons through which approach fits which downstream need.

Feature engineering. “How would you encode this categorical variable with 400 unique values?” is a practical question. One-hot encoding produces a 400-column sparse matrix. Target encoding (replacing categories with the mean outcome) works but leaks if done sloppily on the training set without cross-validation folds. Embedding it into a neural network layer is the modern approach for high-cardinality categories. Knowing the trade-offs here is the mark of someone who’s shipped models, not just run notebooks.

4. deep learning — know enough to not look blank

Not every DS role requires deep learning knowledge. But if it’s in the job description at all, you’ll get at least one question about it.

Transformer architecture comes up a lot since 2022. At a minimum, know the self-attention mechanism conceptually: each token attends to every other token in the sequence, with learned weights determining how much attention to pay. Know why this is different from RNNs (parallel training, no vanishing gradient issue over long sequences). You don’t need to implement multi-head attention from scratch in most DS interviews, but you should be able to explain the intuition.

Transfer learning is the practical concept: taking a model pre-trained on a large dataset and fine-tuning it on your specific task. Know why it works (learned representations transfer across related tasks) and when it doesn’t (when your domain is very far from the pre-training domain).

5. the production and deployment questions

These questions separate candidates who’ve shipped models from candidates who’ve only built models.

“How do you deploy a model?” At minimum: serialize the model (pickle, joblib, ONNX for cross-framework), wrap it in an API (FastAPI is common), containerize it (Docker), and deploy to a serving infrastructure. The follow-up is usually about monitoring: how do you know when your model is degrading in production? Answers should mention data drift (input distribution shifting), concept drift (the relationship between inputs and outputs changing), and what metrics you’d track.

According to the BLS Occupational Outlook for data scientists, the field is projected to grow 36% through 2033, much faster than most other occupations. That growth means more competition for the roles at good companies, which means the interview bar is rising.

The Stack Overflow Developer Survey 2024 found Python remains the dominant language for data science by a wide margin, with pandas and scikit-learn still the most-used libraries. Knowing the core Python data science stack deeply is more valuable than knowing six frameworks shallowly.

6. practicing the explanation, not just the answers

There’s a real skill gap I’ve noticed between people who understand bias-variance tradeoff and people who can explain it to an interviewer clearly under time pressure. Those are different skills.

If you want to practice explaining your reasoning out loud on ML and stats questions, tools like Craqly let you do mock interviews with AI feedback on clarity and completeness. It’s useful for catching the places where you think you’ve explained something but the explanation didn’t actually land.

That’s mostly what I’ve got. The statistics and ML fundamentals are learnable from any decent textbook. The hard part is whether you can explain them clearly when someone is watching the clock.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top