Series

Big Data Engineering

A structured series for learning Big Data Engineering through published articles.

12

Articles

3h 20m

Estimated reading

Intermediate to Advanced

Knowledge level

299

Readers

Start Series

About this series

A structured series for learning Big Data Engineering through published articles.

Learn with real world examples
Connect articles into a structured path
Best practices and trade-offs
Interview focused insights
Continuously updated content

Series Progress

0% Complete

0 of 12 articles viewed

Continue Learning

Data Warehouse vs Data Lake vs Data Lakehouse: Which One to Choose?

Article 1 of 12

Continue Reading

Who is this for?

Software engineers and developers learning this topic.

Knowledge Level

Intermediate to Advanced

Last Updated

May 29, 2026

A

Created by

Abstract Algorithms

All Articles

Article 1

Data Warehouse vs Data Lake vs Data Lakehouse: Which One to Choose?

TLDR: Warehouse = structured, clean data for BI and SQL dashboards (Snowflake, BigQuery). Lake = raw, messy data for ML and data science (S3, HDFS). Lakehouse = open table formats (Delta Lake, Iceberg

15 min read

Article 2

Big Data Architecture Patterns: Lambda, Kappa, CDC, Medallion, and Data Mesh

TLDR: A serious data platform is defined less by where files are stored and more by how changes enter the system, how serving layers are materialized, and who owns quality over time. Lambda, Kappa, CD

17 min read

Article 3

Data Pipeline Orchestration Pattern: DAG Scheduling, Retries, and Recovery

TLDR: Pipeline orchestration is an operational control plane problem that requires explicit dependency, retry, and backfill contracts. TLDR: Pipeline orchestration is less about drawing DAGs and mor

14 min read

Article 4

Dimensional Modeling and SCD Patterns: Building Stable Analytics Warehouses

TLDR: Dimensional modeling with explicit SCD policy is the foundation for reproducible metrics and trustworthy historical analytics. TLDR: Dimensional models stay trustworthy only when teams define

15 min read

Article 5

Lambda Architecture Pattern: Balancing Batch Accuracy with Streaming Freshness

TLDR: Lambda architecture is justified when replay correctness and sub-minute freshness are both non-negotiable despite dual-path complexity. TLDR: Lambda architecture is a fit only when you need bo

14 min read

Article 6

Stream Processing Pipeline Pattern: Stateful Real-Time Data Products

TLDR: Stream pipelines succeed when event-time semantics, state management, and replay strategy are designed together β€” and Kafka Streams lets you build all three directly inside your Spring Boot serv

15 min read

Article 7

Big Data 101: The 5 Vs, Ecosystem, and Why Scale Breaks Everything

TLDR: Traditional databases fail at big data scale for three concrete reasons β€” storage saturation, compute bottleneck, and write-lock contention. The 5 Vs (Volume, Velocity, Variety, Veracity, Value)

21 min read

Article 8

Kappa Architecture: Streaming-First Data Pipelines

TLDR: Kappa architecture replaces Lambda's batch + speed dual codebases with a single streaming pipeline backed by a replayable Kafka log. Reprocessing becomes replaying from offset 0. One codebase, n

21 min read

Article 9

Medallion Architecture: Bronze, Silver, and Gold Layers in Practice

TLDR: Medallion Architecture solves the "data swamp" problem by organizing a data lake into three progressively refined zones β€” Bronze (raw, immutable), Silver (cleaned, conformed), Gold (aggregated,

23 min read

Article 10

Modern Table Formats: Delta Lake vs Apache Iceberg vs Apache Hudi

TLDR: Delta Lake, Apache Iceberg, and Apache Hudi are open table formats that wrap Parquet files with a transaction log (or snapshot tree) to deliver ACID guarantees, time travel, schema evolution, an

24 min read

Article 11

Data Governance Essentials: Framework and Best Practices

TLDR: πŸ“‹ Data governance is the framework that answers "who owns this data, who can access it, and what quality standards must it meet?" Without governance, data pipelines become chaotic. Implement it

9 min read

Article 12

Data Lineage Explained: Tracking Data Flow Across Your Organization

TLDR: πŸ“Š Data lineage is the complete genealogy of your data β€” where it comes from, how it's transformed, and where it ends up. It's critical for debugging pipelines, proving compliance, and understan

12 min read

Big Data Engineering: Learning Roadmap

You have terabytes of customer data but your ETL pipelines keep breaking. Your analytics queries take 8 hours to run what should be 20 minutes of work. You've heard about Apache Spark and Kafka but don't know where to start or which problems they actually solve.

Here's the central challenge: big data engineering isn't just about learning toolsβ€”it's about understanding which problems require which solutions, and in what order to learn them. This roadmap provides a decision-tree approach to master big data systems from the ground up.

TLDR: Navigate big data engineering through a structured decision tree: start with fundamentals (5 Vs + storage paradigms), choose your architecture pattern (Lambda/Kappa/Medallion), build pipelines (orchestration + processing), then advance to production concerns (dimensional modeling + modern table formats).

What You'll Learn

Understand Big Data Engineering through real published examples

Follow a sequence of 12 articles from fundamentals to deeper topics

Connect related concepts: architecture, big data, data-engineering

Practice explaining trade-offs and implementation decisions

Prerequisites

Basic software engineering knowledge
Comfort reading technical articles

FAQs

How should I read this series?

Start from the first article if you are new, or use the article list to jump into the most relevant topic.

Is progress automatic?

Progress is based on articles opened from this browser using the local learning history.