Data Warehouse vs Data Lake vs Data Lakehouse: Which One to Choose?
Snowflake, Databricks, or S3? We explain the evolution of data storage. Learn when to use a struc...
Abstract AlgorithmsTLDR: Warehouse = structured, clean data for BI and SQL dashboards (Snowflake, BigQuery). Lake = raw, messy data for ML and data science (S3, HDFS). Lakehouse = open table formats (Delta Lake, Iceberg) that bring SQL performance to raw storage β the best of both.
π Three Ways to Store Data at Scale: What Problem Each Solves
As organizations grow, they end up with data coming from many sources: transactional databases, clickstreams, IoT sensors, third-party APIs. Storing and querying all of it efficiently requires choosing the right abstraction.
| Question | Warehouse | Lake | Lakehouse |
| What data format? | Structured tables, SQL | Raw files (JSON, Parquet, CSV) | Open formats (Parquet + metadata layer) |
| Who uses it? | Analysts, BI tools | Data scientists, ML engineers | Both |
| Primary operation | SQL aggregations | Batch ML training, exploration | SQL + ML on same data |
| When to update data? | ETL: clean before storing | ELT: store raw, transform later | ACID on open files |
| Cost model | Compute + managed storage | Cheap object storage | Object storage + optional compute |
π OLAP, OLTP, and the Anatomy of a Data Decision
Before picking a storage architecture, it helps to understand the two fundamentally different ways organizations query data.
OLTP (Online Transaction Processing) is what your application's production database does β short, fast reads and writes: "insert this order", "update this balance". PostgreSQL and MySQL are OLTP systems, optimized for low-latency single-row operations.
OLAP (Online Analytical Processing) is what happens when Finance asks "what was our revenue by product category for every quarter in the last three years?" β a query scanning billions of rows. Warehouses, lakes, and lakehouses are all OLAP-oriented.
| Dimension | OLTP (Production DB) | OLAP (Warehouse / Lake) |
| Query type | Single-row reads/writes | Multi-row aggregations |
| Latency target | < 10 ms | Seconds to minutes |
| Data format | Normalized rows | Columnar, denormalized |
| Scale | GB to TB | TB to PB |
| Example | PostgreSQL, MySQL | Snowflake, S3, Delta Lake |
A second key distinction is data shape:
- Structured β rows and columns with a fixed schema (orders, events, user profiles).
- Semi-structured β self-describing but variable layout (JSON, Parquet, Avro logs).
- Unstructured β no schema at all (images, PDFs, audio, raw text).
Warehouses only accept structured data. Lakes and lakehouses handle all three. If your organization generates diverse data β ML training sets, sensor readings, chat logs β a warehouse alone is a dead end.
π’ Deep Dive: The Data Warehouse β SQL on Clean, Structured Data
A warehouse ingests pre-processed, schema-validated data. The ETL pipeline (Extract, Transform, Load) cleans and normalizes data before it enters.
flowchart LR
Source[Operational DBs\nAPIs / Events] --> ETL[ETL Pipeline\nTransform + Validate]
ETL --> DW[Data Warehouse\nSnowflake / BigQuery\nRedshift]
DW --> BI[BI Tools\nTableau / Looker]
DW --> Analyst[SQL Analysts]
Strengths:
- Sub-second query performance on aggregations over billions of rows.
- Strong governance: schema enforcement, row-level security, audit trails.
- Familiar SQL interface for business users.
Limitations:
- Only stores structured, pre-defined schemas.
- Poor fit for unstructured data (images, logs, text, audio).
- Re-processing raw data means going back to source β the warehouse discards it.
βοΈ The Data Lake: Raw Files for Every Use Case
A lake stores everything in its raw form β no transformation required at write time. Data scientists can explore it, ML engineers can train on it, and analysts can query subsets of it.
flowchart LR
Source2[Any source] --> Lake[Data Lake\nS3 / GCS / HDFS\nParquet, JSON, CSV, Images]
Lake --> Spark[Spark Jobs\nML Training]
Lake --> Notebook[Notebooks\nExploration]
Lake --> DW2[Copy subsets\nto Warehouse]
Strengths:
- Stores anything: structured, semi-structured, unstructured.
- Cheap object storage β pennies per GB.
- Supports large-scale ML training with frameworks like Spark, Ray, PyTorch.
Limitations:
- "Data swamp" risk: without governance, data becomes untrustworthy and unmaintainable.
- No ACID transactions β multiple writers can corrupt files.
- Poor for interactive SQL queries (full file scans, no indexes).
π§ Deep Dive: The Data Lakehouse: Open Table Formats Bridge the Gap
The lakehouse uses open table formats (Delta Lake, Apache Iceberg, Apache Hudi) on top of cheap object storage to add the features a warehouse provides:
- ACID transactions on Parquet files using a transaction log.
- Schema enforcement and evolution.
- Time travel β query data as it existed at any past timestamp.
- Query optimization β data skipping, Z-ordering, compaction.
flowchart TB
Raw[S3 / GCS\nParquet files] --> TableFmt[Delta Lake / Iceberg\nTransaction Log + Stats]
TableFmt --> SparkSQL[Spark SQL\nML Pipelines]
TableFmt --> Presto[Presto / Trino\nInteractive SQL]
TableFmt --> BITool[BI Tools\nnative connector]
Used by Databricks (Delta Lake), Netflix (Iceberg), Uber (Hudi). You keep cheap storage but gain warehouse-quality query performance and data consistency.
π ETL vs ELT vs Lakehouse Pipelines: Seeing the Difference
The processing philosophy is as important as the storage layer. Here is how data flows in each paradigm β from the same raw sources all the way to the consumer:
flowchart TD
SRC[Data Sources\nAPIs Β· Databases Β· Event Logs]
SRC --> ETL_T[Transform & Clean\nETL Pipeline]
ETL_T --> WH[Warehouse\nSnowflake Β· BigQuery]
WH --> BI_T[Dashboards & Reports\nBI Analysts]
SRC --> ELT_L[Land Raw Files\nS3 Β· GCS Β· HDFS]
ELT_L --> ELT_T[Transform in Place\nSpark Β· dbt]
ELT_T --> ML[ML Training\nData Science Teams]
SRC --> LH_L[Land as Parquet\nS3 + Delta Lake / Iceberg]
LH_L --> LH_T[ACID Table Layer\nTransaction Log + Stats]
LH_T --> SQL_Q[SQL Queries\nBI Analysts]
LH_T --> ML2[ML Pipelines\nData Science Teams]
The key contrast in practice:
- ETL cleans before writing β fast to query, but the raw data is discarded.
- ELT writes raw first β flexible, but requires transformation discipline downstream.
- Lakehouse writes raw Parquet with a governance layer β you get fast queries and raw access.
Think of ETL as a hotel that only admits guests in formal attire: clean and easy to serve, but many guests can't get in. A data lake is a campground that takes everyone with no rules. A lakehouse is a hotel that accepts everyone and maintains order internally.
π Architecture Data Flow Comparison
flowchart LR
subgraph WH[Warehouse: ETL Path]
S1[Raw Source] --> T1[Transform & Clean]
T1 --> DW1[(Snowflake\nBigQuery)]
DW1 --> BI1[BI Dashboards]
end
subgraph LK[Lake: ELT Path]
S2[Raw Source] --> L2[(S3 / HDFS\nRaw Parquet)]
L2 --> T2[Transform in Spark]
T2 --> ML2[ML Training]
end
subgraph LH[Lakehouse: Unified Path]
S3[Raw Source] --> P3[(S3 + Delta Lake\nIceberg Table)]
P3 --> Q3[SQL via Trino]
P3 --> ML3[ML via Spark]
end
π ETL Pipeline vs ELT Pipeline
sequenceDiagram
participant Src as Data Source
participant ETL as ETL Engine
participant WH as Warehouse
participant Lake as Data Lake
participant ELT as Spark/dbt
Note over Src,WH: ETL β Transform before load
Src->>ETL: Raw extract (orders, events)
ETL->>ETL: Validate + clean + schema
ETL->>WH: Load clean structured data
WH-->>ETL: Indexed, ready for SQL
Note over Src,ELT: ELT β Load raw, transform later
Src->>Lake: Land raw files (JSON/Parquet)
Lake-->>ELT: Raw data available
ELT->>Lake: Transform in place (dbt/Spark)
Lake-->>ELT: Query-ready tables
π Real-World Application: Netflix, Airbnb, and Spotify
Airbnb and the SQL-First Warehouse
Airbnb runs Snowflake to power thousands of internal BI reports. Analysts write SQL to query pre-aggregated booking, pricing, and host data. Because the data is clean and typed, a complex multi-table aggregation across millions of listings completes in seconds. Business users never wait for Spark jobs.
Best fit: high-frequency SQL dashboards, regulatory reporting, executive metrics.
Spotify and the Petabyte Data Lake
Spotify stores petabytes of raw listening events in S3 as Parquet files. Machine learning teams pull that raw data directly into training pipelines with Spark or PyTorch. The raw signal matters β a cleaned-up summary would lose the nuance that separates "skip after 5 seconds" from "replayed three times immediately."
Best fit: ML training at scale, raw event archival, research workloads requiring full signal fidelity.
Netflix and the Lakehouse Migration
Netflix migrated from a plain data lake to Apache Iceberg on S3 to solve a costly problem: hundreds of tables written to simultaneously by multiple teams were occasionally corrupting each other. Iceberg's ACID transaction log eliminated the corruption and unlocked time travel β Netflix can now audit any table as it existed at any past timestamp.
Uber adopted Apache Hudi for similar reasons: incremental upserts (updating individual ride records) are impossible on a plain lake but trivial with Hudi's merge-on-read format.
| Company | Architecture | Primary Driver |
| Airbnb | Warehouse (Snowflake) | BI speed, analyst self-service |
| Spotify | Data Lake (S3 + Parquet) | ML at petabyte scale |
| Netflix | Lakehouse (Iceberg on S3) | ACID correctness + time travel |
| Uber | Lakehouse (Hudi on S3) | Incremental upserts on raw data |
βοΈ Trade-offs & Failure Modes: Trade-offs, Failure Modes & Decision Guide: Which One to Pick
| If your team primarily needs⦠| Choose |
| Fast SQL dashboards for business users | Warehouse (Snowflake, BigQuery) |
| ML model training on raw, diverse data | Lake (S3 + Parquet) |
| Both SQL and ML on the same data | Lakehouse (Delta Lake, Iceberg) |
| Quick start with minimal ops overhead | Warehouse (fully managed) |
| Maximum flexibility and open standards | Lakehouse |
| Smallest storage cost, OK with management complexity | Lake |
In practice, many large organizations run all three: a warehouse for reporting, a lake for raw archival and ML, and a lakehouse for the intersection.
π§ͺ Choosing Your Architecture: A Three-Question Decision Walk
You do not need to memorize the theory β answering three questions about your actual use case gets you to the right answer every time.
Question 1 β Who consumes the data?
- Business analysts running SQL dashboards β lean toward Warehouse.
- Data scientists training ML models β lean toward Lake.
- Both groups sharing the same dataset β lean toward Lakehouse.
Question 2 β What shape is your data?
- All structured (fixed schema, rows and columns) β Warehouse works perfectly.
- Mix of structured, semi-structured, and unstructured β Lake or Lakehouse required.
Question 3 β Do you need ACID correctness?
- Multiple writers, strict consistency, or incremental updates required β Warehouse or Lakehouse.
- Append-only batch jobs, OK with eventual consistency β Lake is sufficient.
Worked example β an e-commerce startup:
The startup has three data consumers:
- A finance dashboard querying daily revenue (SQL, structured data).
- A recommendation model training on raw clickstreams (ML, semi-structured data).
- A fraud detection system applying incremental risk-score updates (ACID required).
Verdict: A Lakehouse (Delta Lake or Iceberg on S3) covers all three. The same Parquet files power the BI dashboard via Trino SQL, the ML training pipeline via Spark, and the fraud model's incremental ACID writes β no data duplication, single source of truth, one storage bill.
π Five Lessons Learned From Real Data Architecture Failures
1. A data lake without a catalog becomes a data swamp. Without a metadata catalog (Apache Atlas, AWS Glue, or DataHub), teams cannot discover what data exists, how fresh it is, or who owns it. Within months, the lake fills with orphaned datasets nobody trusts. Governance is not optional β build it before the lake grows.
2. Over-engineering to a lakehouse before you need it adds cost. If your team is five analysts writing SQL with zero ML use cases, a managed warehouse (BigQuery, Snowflake) ships faster and costs less to operate. Adopt a lakehouse when your teams genuinely diverge in their data needs, not before.
3. ETL pipelines become bottlenecks as schemas evolve. Warehouse ETL pipelines hardcode transformations. Every time a source system adds a column, someone updates a pipeline. An ELT or lake-first approach stores raw data untouched β schema changes do not break ingestion, and transformations happen on demand downstream.
4. Coupling compute and storage is expensive at scale. Legacy on-premises warehouses couple compute and storage: scaling one means scaling both. Modern cloud designs (Snowflake virtual warehouses, Databricks clusters over S3) decouple them β pay only for active compute, store data cheaply at rest.
5. Time travel is more useful than it sounds. Delta Lake and Iceberg let you query data as it existed 30 days ago. This is not just an audit feature β it is how engineers debug ML model regressions ("why did accuracy drop last Tuesday?") and recover from accidental bulk deletes without triggering a full restore.
π TLDR: Summary & Key Takeaways
- Warehouse: clean, structured, SQL, fast dashboards. Bad for unstructured or raw data.
- Lake: raw, cheap, flexible. Good for ML. Bad for interactive SQL and governance.
- Lakehouse: open table formats add ACID + SQL performance on top of cheap object storage.
- ETL (warehouse) transforms before storing; ELT (lake/lakehouse) stores raw and transforms later.
- Most serious data organizations run a hybrid: warehouse for BI, lake/lakehouse for ML.
π Practice Quiz
A company stores pre-cleaned sales data in Snowflake and runs daily revenue dashboards. Which storage paradigm best describes this setup?
- A) Data Lake
- B) Data Lakehouse
- C) Data Warehouse
- D) OLTP database
Correct Answer: C β Snowflake is a cloud data warehouse. It stores structured, pre-transformed data and serves SQL-based BI queries, which matches the warehouse pattern exactly.
A data science team needs to train a PyTorch model on five years of raw JSON clickstream events. The events are messy and schema-less. Which storage approach fits best?
- A) Data Warehouse (BigQuery)
- B) Data Lake (S3 + Parquet)
- C) OLTP relational database
- D) In-memory cache (Redis)
Correct Answer: B β A data lake on object storage (S3) stores raw, schema-less files cheaply and lets ML pipelines read them directly with Spark or PyTorch. A warehouse would reject the unstructured schema.
What specific capability does Apache Iceberg add to raw Parquet files stored on S3?
- A) Faster CPU cores for query execution
- B) ACID transactions, schema evolution, and time travel via a transaction log
- C) Automatic data compression into ZIP archives
- D) Built-in SQL JOIN optimization
Correct Answer: B β Iceberg maintains a transaction log alongside the Parquet files. This log is what enables ACID writes (multiple concurrent writers without corruption), schema evolution, and point-in-time queries (time travel).
π οΈ Apache Spark + Delta Lake: SQL and ML on the Same Data with PySpark
Apache Spark is the dominant distributed compute engine for big data analytics and ML pipelines; Delta Lake is an open-source storage layer (built by Databricks) that adds ACID transactions, schema enforcement, time travel, and MERGE/upsert support to Parquet files stored on S3, GCS, or HDFS. Together they form the canonical Python lakehouse stack.
The core advantage over a plain data lake: multiple PySpark jobs can write to the same Delta table concurrently without corrupting each other, BI analysts can query it with SQL, and ML engineers can read the same Parquet files β one storage layer serves both use cases.
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, current_timestamp
from delta import configure_spark_with_delta_pip
# 1. Create a Spark session with Delta Lake extensions
builder = (
SparkSession.builder
.appName("LakehouseDemo")
.config("spark.sql.extensions",
"io.delta.sql.DeltaSparkSessionExtension")
.config("spark.sql.catalog.spark_catalog",
"org.apache.spark.sql.delta.catalog.DeltaCatalog")
)
spark = configure_spark_with_delta_pip(builder).getOrCreate()
# 2. ELT pattern β store raw clickstream events as a Delta table (no pre-transformation)
events_df = spark.read.json("s3://data-lake/raw/clickstream/2024/")
(
events_df
.withColumn("ingested_at", current_timestamp())
.write
.format("delta")
.mode("append") # ACID-safe: concurrent writers never corrupt files
.partitionBy("event_date") # partition pruning speeds up date-range queries
.save("s3://data-lake/delta/clickstream")
)
# 3. SQL analytics on the same Delta table β BI dashboard query
spark.read.format("delta").load("s3://data-lake/delta/clickstream") \
.createOrReplaceTempView("clickstream")
spark.sql("""
SELECT product_id,
COUNT(*) AS page_views,
COUNT(DISTINCT user_id) AS unique_visitors
FROM clickstream
WHERE event_date >= '2024-01-01'
GROUP BY product_id
ORDER BY page_views DESC
LIMIT 10
""").show()
# 4. Time travel β query the table as it existed 7 versions ago (ML debugging / audits)
historical_df = (
spark.read.format("delta")
.option("versionAsOf", 5) # or .option("timestampAsOf", "2024-01-15")
.load("s3://data-lake/delta/clickstream")
)
print(f"Historical version row count: {historical_df.count()}")
# 5. MERGE (upsert) β update existing records, insert new ones
# This operation is impossible on plain S3 Parquet without Delta Lake
spark.sql("""
MERGE INTO clickstream AS target
USING updates AS source
ON target.event_id = source.event_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *
""")
dbt (data build tool) is the transformation companion in production lakehouse pipelines: SQL-based dbt models run on top of the Delta table, applying business-logic transformations with full lineage tracking and incremental materialization β without any Spark job code.
| Layer | Tool | Role |
| Compute | Apache Spark (PySpark) | Distributed read/write, ML training |
| Storage format | Delta Lake / Apache Iceberg | ACID + time travel on Parquet |
| Transformation | dbt | SQL models with lineage |
| BI / SQL | Trino, Presto, Databricks SQL | Interactive analyst queries |
For a full deep-dive on Delta Lake MERGE patterns, Apache Iceberg vs. Delta Lake trade-offs, and dbt incremental model design, a dedicated follow-up post is planned.
π Related Posts
Tags

Written by
Abstract Algorithms
@abstractalgorithms
More Posts

Types of LLM Quantization: By Timing, Scope, and Mapping
TLDR: There is no single "best" LLM quantization. You classify and choose quantization along three axes: when you quantize (timing), what you quantize (scope), and how values are encoded (mapping). In practice, most teams start with weight quantizati...
Stream Processing Pipeline Pattern: Stateful Real-Time Data Products
TLDR: Stream pipelines succeed when event-time semantics, state management, and replay strategy are designed together β and Kafka Streams lets you build all three directly inside your Spring Boot service. Stripe's real-time fraud detection processes...
Service Mesh Pattern: Control Plane, Data Plane, and Zero-Trust Traffic
TLDR: A service mesh intercepts all service-to-service traffic via injected Envoy sidecar proxies, letting a platform team enforce mTLS, retries, timeouts, and circuit breaking centrally β without changing application code. Reach for it when cross-te...
Serverless Architecture Pattern: Event-Driven Scale with Operational Guardrails
TLDR: Serverless is strongest for spiky asynchronous workloads when cold-start, observability, and state boundaries are intentionally designed. TLDR: Serverless works best for spiky, event-driven workloads when you design for idempotency, observabili...
