Free Data Engineering Interview Guide

Question 1

Walk me through how you choose an incremental strategy in dbt. What are the trade-offs between append, merge, and delete+insert?

Model Answer

I start by asking what the data looks like and what correctness guarantees I need. For event streams where records never change, append or insert_overwrite with partitions is cheapest because I load new rows and move on. When records can be updated after the fact, such as order status changes, I use merge, which upserts on a unique key. The trade-off is that merge often requires a full scan of the target table, which gets expensive at scale. Delete+insert is my middle ground: I delete rows in a matching partition window and re-insert corrected data, which avoids full table scans while still handling late updates. I default to merge for correctness, then optimize to delete+insert once I understand the partition shape.

Why Interviewers Ask This

They want to know if you understand the cost and correctness trade-offs of each strategy, not just that the configuration exists.

Question 2

What is Apache Iceberg and why has it become a core part of the modern data stack? How does it differ from Hive table format?

Model Answer

Iceberg is an open table format designed for large analytic tables stored in object storage. The key innovation is tracking data at the file level rather than the partition directory level, which is how Hive works. Hive relies on directory listings to determine what data exists, which creates race conditions when multiple writers touch the same partition. Iceberg solves that with a metadata tree: a catalog pointer references a snapshot file, which references manifest lists, which reference individual Parquet files. That design enables ACID guarantees through optimistic concurrency control, full schema evolution without rewrites, time travel via snapshot history, and hidden partitioning so queries do not need to filter on partition columns explicitly. In practice, it lets me run reliable batch and streaming writes to the same table without locking or Hive metastore bottlenecks.

Why Interviewers Ask This

They are testing whether you understand the metadata layer, not just that Iceberg is a better Hive.

Question 3

When would you choose DuckDB over Spark or a cloud data warehouse like BigQuery?

Model Answer

I choose DuckDB when the data fits in memory or on a single disk and I want SQL analytics without cluster overhead. For local development, ELT prototyping, or ad hoc queries on Parquet files in an S3 bucket, DuckDB is faster to spin up and dramatically cheaper than provisioning a warehouse. It also shines in CI pipelines: I can run dbt models against DuckDB in seconds instead of waiting for a cloud warehouse. Where it falls short is horizontal scale. Once I am joining terabytes across many tables or need concurrent users hitting a shared catalog, Spark or a cloud warehouse is the right call. DuckDB has replaced a large chunk of what I used to use Pandas for, and for single node analytical work it often beats Spark because there is no serialization overhead between executors.

Why Interviewers Ask This

They want to see if you have a calibrated sense of tool selection rather than just knowing DuckDB exists.

Question 4

What makes a good dbt staging model, and why should staging be kept separate from marts?

Model Answer

A good staging model has one job: clean and standardize raw source data without applying business logic. That means renaming columns to a consistent convention, casting types, coalescing nulls where appropriate, and deduplicating if the source delivers duplicates. I always put a source() reference at the base so lineage is tracked in the DAG. Staging stays separate from marts because it enforces separation of concerns. Marts are where business logic lives, including joins, aggregations, and metric definitions. If I bake raw column names or casts directly into a mart, every downstream model inherits that fragility. When a source changes a column name, I fix it once in the staging model and every mart downstream is automatically correct. It also makes onboarding easier because anyone reading a mart can assume clean, typed, deduplicated inputs.

Why Interviewers Ask This

They are testing architectural thinking and whether you build for maintainability, not just correctness.

Question 5

How would you explain the data lakehouse to someone who knows both data lakes and data warehouses? What problem does it solve?

Model Answer

The classic data lake gave you cheap object storage and schema on read flexibility, but you paid for it with no ACID guarantees, poor query performance, and difficult governance. Data warehouses gave you fast SQL and transactions, but at high cost and with proprietary formats that locked you in. The lakehouse tries to get both: you store data in open formats like Parquet or ORC on cheap object storage, then layer a table format like Iceberg or Delta Lake on top to add ACID transactions, schema enforcement, and time travel. A query engine like Trino, Spark, or DuckDB reads that open format directly. In practice, I can run streaming writes and batch reads on the same table, support BI tools and ML training from the same storage layer, and avoid the ETL copy step between a lake and a warehouse. The trade-off is operational complexity because you own the table format, catalog, and query engine separately.

Why Interviewers Ask This

They want to see that you understand the architectural motivation, not just the marketing definition.

Question 6

Explain column pruning and partition pruning. Why do they matter for query performance at scale?

Model Answer

Partition pruning means the query engine eliminates entire files or directories from a scan based on the filter predicates in the query. If my Iceberg table is partitioned by event_date and I filter with WHERE event_date = '2026-01-15', the engine reads the manifest list, identifies which data files belong to that partition, and skips everything else. Column pruning is the complement. Because Parquet is columnar, the engine reads only the columns referenced in the SELECT and WHERE clauses, skipping the rest of the file entirely. Together, these mechanisms turn a multi terabyte scan into a targeted read of a small fraction of the data. This goes wrong when engineers use non-selective filters or store wide tables with many unused columns and then wonder why queries are slow. Getting partition design right upfront is one of the highest leverage decisions in lakehouse work.

Why Interviewers Ask This

They are checking whether you understand how query engines interact with the storage layer, not just how to write SQL.

Question 7

What tests do you write in dbt and how do you catch schema changes before they cause downstream failures?

Model Answer

My baseline for every model is not_null and unique on primary keys, plus relationships tests on foreign key joins. For categorical columns I add accepted_values to catch silent schema changes where a source adds a new status code that breaks downstream CASE statements. Beyond built in tests, I use dbt-expectations for column level statistical checks, like row count within an expected range or no unexpected nulls in columns that should always be populated. To catch schema changes proactively, I lean on dbt source freshness combined with schema contracts in dbt 1.5 and later. Source contracts enforce column names and types at run time, so if an upstream team renames a column the job fails loudly at the staging layer rather than silently propagating nulls into a mart. I also run dbt test in CI on every PR, not just on deploys, so issues surface before they hit production.

Why Interviewers Ask This

They want to know if you treat data quality as a first class engineering concern, not an afterthought.

Question 8

What is the semantic layer and why does it matter? How do tools like dbt Metrics and Cube.dev fit in?

Model Answer

The semantic layer sits between transformed data and the people or tools consuming it. It defines business metrics, dimensions, and joins in one place so that revenue means the same thing whether an analyst queries it in Tableau, an engineer hits the API, or a stakeholder runs it in a notebook. Without it, you get metric sprawl, where different dashboards compute the same metric in slightly different ways. dbt Metrics, now the MetricFlow layer in dbt, lets you define metrics in YAML alongside your models so the definition travels with lineage. Cube.dev takes a similar idea further by adding a caching and API layer so BI tools can query the semantic layer directly rather than raw SQL. The practical value is consistency and trust. When finance and product see the same number from the same definition, you stop spending half of every meeting arguing about whose query is right.

Why Interviewers Ask This

They are testing whether you think about data as a product consumed by many stakeholders, not just a set of tables for analysts to query.

Question 9

How do you control costs on a cloud data platform like Snowflake, BigQuery, or Databricks? Walk me through your actual approach.

Model Answer

I approach cost management at three levels. First, query design: I push partition and column pruning as hard as possible, avoid SELECT star on wide tables, and cluster or partition tables on the columns most often used in filters and joins. Second, compute configuration: on Snowflake I right size warehouses and turn on auto suspend aggressively so idle warehouses do not burn credits. On BigQuery I prefer slot reservations for predictable workloads over on demand pricing once volume is stable. On Databricks I use spot instances for batch jobs and reserve on demand capacity only for jobs with SLA requirements. Third, observability: I set up cost dashboards by team and query tag, alert on anomalous spend, and do a monthly query audit to find the ten most expensive queries. The biggest lever I have found is killing repeated full table scans caused by missing filters, which often cuts spend by 30 to 50 percent with one query fix.

Why Interviewers Ask This

They want to know if you treat infrastructure cost as an engineering responsibility and can speak to concrete levers, not just say use smaller warehouses.

Question 10

What does an AI-augmented data pipeline look like in practice? How are teams actually using LLMs in data engineering workflows today?

Model Answer

I have seen a few real patterns. The most mature is LLM assisted text to SQL, where a semantic layer description is passed to a model and it generates a query from a natural language question. The key engineering challenge is structured output. You cannot let the model return free text and hope it is valid SQL. I use function calling or a library like Instructor to force the model to return a typed schema. A second pattern is automated data classification: piping column samples through an LLM to detect PII, suggest data types, or generate column descriptions for a data catalog, which reduces manual documentation time. A third is pipeline monitoring: feeding anomaly signals and recent query logs to a model to generate a root cause hypothesis. The honest caveat is that LLMs can hallucinate, so any AI step in a pipeline needs a validation gate. I treat LLM output as a draft that gets checked, not a final answer that gets committed.

Why Interviewers Ask This

They want to know if you have thought critically about where AI adds real value in data workflows versus where it introduces risk, and whether you have hands on experience beyond reading blog posts.

Your Free 2026 Data Engineering Interview Cheat Sheet

What you will get

Top 10 Modern Data Stack Questions

Walk me through how you choose an incremental strategy in dbt. What are the trade-offs between append, merge, and delete+insert?

Model Answer

Why Interviewers Ask This

What is Apache Iceberg and why has it become a core part of the modern data stack? How does it differ from Hive table format?

Model Answer

Why Interviewers Ask This

When would you choose DuckDB over Spark or a cloud data warehouse like BigQuery?

Model Answer

Why Interviewers Ask This

What makes a good dbt staging model, and why should staging be kept separate from marts?

Model Answer

Why Interviewers Ask This

How would you explain the data lakehouse to someone who knows both data lakes and data warehouses? What problem does it solve?

Model Answer

Why Interviewers Ask This

Explain column pruning and partition pruning. Why do they matter for query performance at scale?

Model Answer

Why Interviewers Ask This

What tests do you write in dbt and how do you catch schema changes before they cause downstream failures?

Model Answer

Why Interviewers Ask This

What is the semantic layer and why does it matter? How do tools like dbt Metrics and Cube.dev fit in?

Model Answer

Why Interviewers Ask This

How do you control costs on a cloud data platform like Snowflake, BigQuery, or Databricks? Walk me through your actual approach.

Model Answer

Why Interviewers Ask This

What does an AI-augmented data pipeline look like in practice? How are teams actually using LLMs in data engineering workflows today?

Model Answer

Why Interviewers Ask This

Want 50 more questions? Get the full Interview Drill Pack for $39