There’s a question that comes up in nearly every senior data engineering interview at companies like Databricks, Stripe, and the larger fintechs, and most candidates answer it wrong the first time they hear it. The question is: “Walk me through how you’d design a pipeline for late-arriving data.” It sounds operational. It’s actually a systems thinking question, and the wrong answer is any answer that doesn’t immediately ask “how late, and what’s the downstream consumer’s tolerance for stale data?”
That gap, between what looks like a technical question and what’s actually being tested, is what this post is about.
Data modeling and SQL
These come first in most interviews, often as a warm-up, but they’re not trivial.
The canonical questions: explain the difference between a star and snowflake schema. Explain what a slowly changing dimension is and how you’d implement SCD Type 2. Write a query to find duplicate records on a composite key. Write a window function query to calculate a running total per customer.
What’s actually being tested with the SCD Type 2 question isn’t whether you know the concept. It’s whether you’ve thought about what happens at the edges — what do you do when a record arrives that should have been backdated? What’s your strategy for current-row flagging vs. valid_from/valid_to date ranges? Have you seen this bite a pipeline in production?
The SQL questions at senior levels often involve refactoring. You’ll be shown a query with six levels of nested subqueries and asked to make it maintainable. CTEs are usually the right move, but the interesting part of the answer is knowing when a temp table or materialized view would perform better on a large scan.
ETL/ELT and pipeline design
These questions are where you separate candidates who’ve read the docs from candidates who’ve been paged at 2am because a pipeline silently failed.
Common questions: What does idempotency mean for a data pipeline, and how do you enforce it? How do you handle schema evolution without breaking downstream consumers? What’s your strategy for backfilling three months of data for a new column? How do you test a transformation pipeline?
On idempotency: the expected answer mentions MERGE/UPSERT patterns, partition-based overwrites, and avoiding blind appends. But the strong answer also mentions that idempotency is a contract, not just a property — you have to make sure your orchestration layer won’t kick off duplicate runs in unexpected ways, and that any external API calls in the pipeline handle deduplication too.
Schema evolution is genuinely hard. Good answers mention backwards-compatible changes (adding nullable columns), forwards-compatible changes (adding fields consumers can ignore), and the places where neither is safe (column renames, type changes). Tools like Apache Avro with a schema registry exist partly to solve this, and knowing that signals real experience.
Pipeline testing is an area where I think a lot of senior engineers give surprisingly weak interview answers. Unit tests on transformations are obvious. Data quality assertions — things dbt tests or Great Expectations can do — are the real signal. Even better: being able to describe a silent failure you’ve caught with a quality assertion that would have gone unnoticed for days otherwise.
Big data, Spark, and distributed systems
These questions show up more at companies operating at scale. Startups doing under a terabyte a day often don’t go here.
Expect questions like: What happens when you call .collect() on a large DataFrame? Explain data skew and how you’d diagnose it. What’s the difference between a narrow and wide transformation? When would you use a broadcast join?
The .collect() question tests whether you understand the difference between Spark’s lazy evaluation model (transformations build a DAG, nothing executes) and actions (which trigger execution and, in the case of .collect(), pull everything back to the driver). Pulling 500GB to a driver node will crash your job and probably wake someone up.
Data skew is a practical problem that interviewers use to gauge whether you’ve operated pipelines at scale. The symptom is one Spark task taking 10 to 50 times longer than the others. The causes are usually uneven key distribution (one customer_id has 30% of the records) or a join between a large table and a medium table where many keys map to one side. Solutions: salting, repartitioning, broadcast joins for the small side of a skewed join.
Modern stack questions
The “modern data stack” isn’t one thing, but there are a few tools that show up in enough job descriptions that you should know them.
dbt is now essentially standard for transformation work in warehouse-centric shops. Know the four materialization types (view, table, incremental, ephemeral) and when you’d pick each. Know what a ref() does and why it matters for DAG lineage. Know how incremental models handle late-arriving data and what the is_incremental() macro does.
Data lakehouse architecture combines warehouse-style ACID compliance with data lake storage economics. Databricks Delta Lake and Apache Iceberg are the two main formats. Know what ACID means in this context (specifically: what does atomicity mean for a file-based table that might have 50,000 Parquet files?), and know what time travel is and when you’d actually use it.
The Stack Overflow Developer Survey 2024 shows that dbt, Spark, and cloud warehouse tools (Snowflake, BigQuery, Redshift) are used by a significant plurality of data engineers. Knowing at least one warehouse deeply and having opinions about the others reads better than shallow knowledge of everything.
Data quality and observability
This area gets asked about more than it gets prepared for, in my experience. “How do you monitor a data pipeline?” is a question that surfaces in most onsites.
The weak answer: “I set up alerting on job failures.” That’s necessary but not sufficient. The strong answer includes: schema drift detection, row count anomaly detection (did the pipeline produce 40% fewer rows than the rolling 7-day average?), freshness SLAs (if the table hasn’t updated in N hours, alert), and field-level quality assertions on things like null rates and value distributions.
Tools like Monte Carlo, Bigeye, and the built-in testing in dbt exist precisely because silent data quality failures are more expensive than pipeline failures. A pipeline that crashes is obvious. A pipeline that produces wrong results quietly can corrupt weeks of downstream analysis before someone notices.
According to the BLS Occupational Outlook, database and data engineering roles are projected to grow 9% through 2032. That’s fast but probably understates the actual demand, given how many analytics engineering roles are being created that the BLS classification doesn’t neatly capture.
What the best candidates do differently
Many interviewers, including people I’ve talked to at mid-size tech companies and larger data-heavy startups, say the same thing: the candidates who move forward are the ones who reason about trade-offs out loud rather than jumping to one right answer.
“Here’s the pattern I’d use and here’s what I’d be giving up” is almost always a stronger signal than “the correct answer is X.” Data engineering involves constant trade-off decisions under constraint. Demonstrating that reasoning process matters as much as getting the technical fact right.
If you want to practice that out-loud reasoning, especially on live technical questions where you can get feedback on whether your explanation is clear, Craqly’s AI interview practice is built for exactly that workflow. It’s useful for drilling the communication layer, which is the part most solo prep skips.
What are companies actually testing for with data skew questions? Nine times out of ten it’s not whether you know the vocabulary. It’s whether you’ve been in a situation where things went wrong at scale, and whether you learned from it.