Series
Apache Spark Engineering
A structured series for learning Apache Spark Engineering through published articles.
15
Articles
7h 9m
Estimated reading
Intermediate to Advanced
Knowledge level
381
Readers
About this series
A structured series for learning Apache Spark Engineering through published articles.
Series Progress
0% Complete0 of 15 articles viewed
Continue Learning
Apache Spark for Data Engineers: RDDs, DataFrames, and Structured Streaming
Article 1 of 15
Continue ReadingWho is this for?
Software engineers and developers learning this topic.
Knowledge Level
Intermediate to Advanced
Last Updated
May 25, 2026
Created by
Abstract Algorithms
All Articles
Article 1
Apache Spark for Data Engineers: RDDs, DataFrames, and Structured Streaming
TLDR: Apache Spark distributes Python DataFrame jobs across a cluster of executors, using lazy evaluation and the Catalyst query optimizer to process terabytes with the same code that works on gigabyt
19 min read
Article 2
Spark Adaptive Query Execution: Dynamic Coalescing, Pruning, and Skew Handling
TLDR: Before AQE, Spark compiled your entire query into a static physical plan using size estimates that were frequently wrong — and a wrong estimate at planning time meant a skewed join, 800 small ta
39 min read
Article 3
Spark Architecture: Driver, Executors, DAG Scheduler, and Task Scheduler Explained
TLDR: Spark's architecture is a precise chain of responsibility. The Driver converts user code into a DAG, the DAGScheduler breaks it into stages at shuffle boundaries, the TaskScheduler dispatches ta
28 min read
Article 4
Broadcast Joins vs Sort-Merge Joins in Spark
📖 The 45-Minute Join Stage That Became 90 Seconds A data engineering team at a retail company was running a nightly Spark job that joined their 500 GB transaction fact table against a 50 MB product d
26 min read
Article 5
Caching and Persistence in Spark: Storage Levels and When to Use Them
TLDR: Calling cache() or persist() does not immediately store anything — Spark caches lazily at the first action, partition by partition, managed by a per-executor BlockManager. When memory fills up,
24 min read
Article 6
Spark DataFrames and Spark SQL: Schema, DDL, and the Catalyst Optimizer
TLDR: Catalyst is Spark's query compiler. It takes any DataFrame operation or SQL string, parses it into an abstract syntax tree, resolves column references against the catalog, applies a library of a
24 min read
Article 7
Spark Executor Sizing: Memory Model, Core Tuning, and GC Strategy
TLDR: Spark executor OOMs are almost never caused by insufficient total cluster RAM — they are caused by misallocating memory across five distinct JVM regions while ignoring GC behavior and memoryOver
37 min read
Article 8
Kafka and Spark Structured Streaming: Building a Production Pipeline
📖 The 500K-Event Problem: When a Naive Kafka Consumer Falls Apart An analytics platform at a mid-sized fintech company needs to process 500,000 payment events per second from a Kafka cluster. The tea
23 min read
Article 9
Spark on Kubernetes: Operator, Dynamic Allocation, and Production Monitoring
TLDR: Running Spark on Kubernetes replaces YARN's static queue model with a container-native, elastically-scaled execution environment. The kubeflow Spark Operator manages SparkApplication CRDs throug
36 min read
Article 10
Partitioning in Spark: HashPartitioner, RangePartitioner, and Custom Strategies
TLDR: Spark's partition count and partitioning strategy are the two levers that determine whether a job scales linearly or crumbles under data growth. HashPartitioner distributes keys by hash modulo —
26 min read
Article 11
Reading and Writing Data in Spark: Parquet, Delta, JSON, and JDBC
TLDR: Parquet's columnar layout with row-group statistics enables predicate pushdown that can reduce a 500 GB scan to 8 GB. Delta Lake wraps Parquet with a JSON transaction log to add ACID semantics a
34 min read
Article 12
Shuffles in Spark: Why groupBy Kills Performance
TLDR: A Spark shuffle is the most expensive operation in any distributed job — it moves every matching key across the network, writes temporary sorted files to disk, and forces a hard synchronization
31 min read
Article 13
Stateful Aggregations in Spark Structured Streaming: mapGroupsWithState
TLDR: mapGroupsWithState gives each streaming key its own mutable state object, persisted in a fault-tolerant state store that checkpoints to object storage on every micro-batch. Where window aggregat
28 min read
Article 14
Spark Structured Streaming: Micro-Batch vs Continuous Processing
📖 The 15-Minute Gap: How a Fraud Team Discovered They Needed Real-Time Streaming A fintech team runs payment fraud detection with a well-tuned Spark batch job. Every 15 minutes it reads a day's worth
27 min read
Article 15
Watermarking and Late Data Handling in Spark Structured Streaming
TLDR: A watermark tells Spark Structured Streaming: "I will accept events up to N minutes late, and then I am done waiting." Spark tracks the maximum event time seen per partition, takes the global mi
27 min read
Apache Spark Engineering: Learning Roadmap
Your Spark jobs are slow, failing with OOM errors, or taking 10x longer than expected. You copy configurations from Stack Overflow, tweak executor memory, and nothing helps. You know Spark is powerful — but you're fighting it rather than using it.
Here's the challenge: Spark's surface API hides enormous internal complexity. A groupBy().agg() that looks simple can trigger a full shuffle of terabytes. This roadmap gives you a mental model of what Spark does under the hood — so you write code that works with the engine, not against it.
TLDR: Master Apache Spark from the ground up: understand the execution model (RDDs, DAGs, shuffle), learn DataFrames and Spark SQL, tune performance with partitioning and caching, implement Structured Streaming, and deploy production Spark jobs with confidence.
What You'll Learn
Understand Apache Spark Engineering through real published examples
Follow a sequence of 15 articles from fundamentals to deeper topics
Connect related concepts: big data, #apache-spark, data-engineering
Practice explaining trade-offs and implementation decisions
Prerequisites
FAQs
How should I read this series?
Start from the first article if you are new, or use the article list to jump into the most relevant topic.
Is progress automatic?
Progress is based on articles opened from this browser using the local learning history.