Series
Big Data Engineering
A structured series for learning Big Data Engineering through published articles.
12
Articles
3h 20m
Estimated reading
Intermediate to Advanced
Knowledge level
299
Readers
About this series
A structured series for learning Big Data Engineering through published articles.
Series Progress
0% Complete0 of 12 articles viewed
Continue Learning
Who is this for?
Software engineers and developers learning this topic.
Knowledge Level
Intermediate to Advanced
Last Updated
May 29, 2026
Created by
Abstract Algorithms
All Articles
Article 1
Data Warehouse vs Data Lake vs Data Lakehouse: Which One to Choose?
TLDR: Warehouse = structured, clean data for BI and SQL dashboards (Snowflake, BigQuery). Lake = raw, messy data for ML and data science (S3, HDFS). Lakehouse = open table formats (Delta Lake, Iceberg
15 min read
Article 2
Big Data Architecture Patterns: Lambda, Kappa, CDC, Medallion, and Data Mesh
TLDR: A serious data platform is defined less by where files are stored and more by how changes enter the system, how serving layers are materialized, and who owns quality over time. Lambda, Kappa, CD
17 min read
Article 3
Data Pipeline Orchestration Pattern: DAG Scheduling, Retries, and Recovery
TLDR: Pipeline orchestration is an operational control plane problem that requires explicit dependency, retry, and backfill contracts. TLDR: Pipeline orchestration is less about drawing DAGs and mor
14 min read
Article 4
Dimensional Modeling and SCD Patterns: Building Stable Analytics Warehouses
TLDR: Dimensional modeling with explicit SCD policy is the foundation for reproducible metrics and trustworthy historical analytics. TLDR: Dimensional models stay trustworthy only when teams define
15 min read
Article 5
Lambda Architecture Pattern: Balancing Batch Accuracy with Streaming Freshness
TLDR: Lambda architecture is justified when replay correctness and sub-minute freshness are both non-negotiable despite dual-path complexity. TLDR: Lambda architecture is a fit only when you need bo
14 min read

Article 6
Stream Processing Pipeline Pattern: Stateful Real-Time Data Products
TLDR: Stream pipelines succeed when event-time semantics, state management, and replay strategy are designed together β and Kafka Streams lets you build all three directly inside your Spring Boot serv
15 min read

Article 7
Big Data 101: The 5 Vs, Ecosystem, and Why Scale Breaks Everything
TLDR: Traditional databases fail at big data scale for three concrete reasons β storage saturation, compute bottleneck, and write-lock contention. The 5 Vs (Volume, Velocity, Variety, Veracity, Value)
21 min read

Article 8
Kappa Architecture: Streaming-First Data Pipelines
TLDR: Kappa architecture replaces Lambda's batch + speed dual codebases with a single streaming pipeline backed by a replayable Kafka log. Reprocessing becomes replaying from offset 0. One codebase, n
21 min read

Article 9
Medallion Architecture: Bronze, Silver, and Gold Layers in Practice
TLDR: Medallion Architecture solves the "data swamp" problem by organizing a data lake into three progressively refined zones β Bronze (raw, immutable), Silver (cleaned, conformed), Gold (aggregated,
23 min read

Article 10
Modern Table Formats: Delta Lake vs Apache Iceberg vs Apache Hudi
TLDR: Delta Lake, Apache Iceberg, and Apache Hudi are open table formats that wrap Parquet files with a transaction log (or snapshot tree) to deliver ACID guarantees, time travel, schema evolution, an
24 min read
Article 11
Data Governance Essentials: Framework and Best Practices
TLDR: π Data governance is the framework that answers "who owns this data, who can access it, and what quality standards must it meet?" Without governance, data pipelines become chaotic. Implement it
9 min read
Article 12
Data Lineage Explained: Tracking Data Flow Across Your Organization
TLDR: π Data lineage is the complete genealogy of your data β where it comes from, how it's transformed, and where it ends up. It's critical for debugging pipelines, proving compliance, and understan
12 min read
Big Data Engineering: Learning Roadmap
You have terabytes of customer data but your ETL pipelines keep breaking. Your analytics queries take 8 hours to run what should be 20 minutes of work. You've heard about Apache Spark and Kafka but don't know where to start or which problems they actually solve.
Here's the central challenge: big data engineering isn't just about learning toolsβit's about understanding which problems require which solutions, and in what order to learn them. This roadmap provides a decision-tree approach to master big data systems from the ground up.
TLDR: Navigate big data engineering through a structured decision tree: start with fundamentals (5 Vs + storage paradigms), choose your architecture pattern (Lambda/Kappa/Medallion), build pipelines (orchestration + processing), then advance to production concerns (dimensional modeling + modern table formats).
What You'll Learn
Understand Big Data Engineering through real published examples
Follow a sequence of 12 articles from fundamentals to deeper topics
Connect related concepts: architecture, big data, data-engineering
Practice explaining trade-offs and implementation decisions
Prerequisites
FAQs
How should I read this series?
Start from the first article if you are new, or use the article list to jump into the most relevant topic.
Is progress automatic?
Progress is based on articles opened from this browser using the local learning history.