Series

Apache Spark Engineering

A structured series for learning Apache Spark Engineering through published articles.

Articles

7h 9m

Estimated reading

Intermediate to Advanced

Knowledge level

381

Readers

Start Series

Client

Gateway

User Service

Order Service

Payment Service

Cache

Database

About this series

A structured series for learning Apache Spark Engineering through published articles.

Learn with real world examples

Connect articles into a structured path

Best practices and trade-offs

Interview focused insights

Continuously updated content

Series Progress

0% Complete

0 of 15 articles viewed

Continue Learning

Apache Spark for Data Engineers: RDDs, DataFrames, and Structured Streaming

Article 1 of 15

Who is this for?

Software engineers and developers learning this topic.

Knowledge Level

Intermediate to Advanced

Last Updated

May 25, 2026

Created by

Abstract Algorithms

All Articles

Article 1

Apache Spark for Data Engineers: RDDs, DataFrames, and Structured Streaming

TLDR: Apache Spark distributes Python DataFrame jobs across a cluster of executors, using lazy evaluation and the Catalyst query optimizer to process terabytes with the same code that works on gigabyt

19 min read

Article 2

Spark Adaptive Query Execution: Dynamic Coalescing, Pruning, and Skew Handling

TLDR: Before AQE, Spark compiled your entire query into a static physical plan using size estimates that were frequently wrong — and a wrong estimate at planning time meant a skewed join, 800 small ta

39 min read

Article 3

Spark Architecture: Driver, Executors, DAG Scheduler, and Task Scheduler Explained

TLDR: Spark's architecture is a precise chain of responsibility. The Driver converts user code into a DAG, the DAGScheduler breaks it into stages at shuffle boundaries, the TaskScheduler dispatches ta

28 min read

Article 4

Broadcast Joins vs Sort-Merge Joins in Spark

📖 The 45-Minute Join Stage That Became 90 Seconds A data engineering team at a retail company was running a nightly Spark job that joined their 500 GB transaction fact table against a 50 MB product d

26 min read

Article 5

Caching and Persistence in Spark: Storage Levels and When to Use Them

TLDR: Calling cache() or persist() does not immediately store anything — Spark caches lazily at the first action, partition by partition, managed by a per-executor BlockManager. When memory fills up,

24 min read

Article 6

Spark DataFrames and Spark SQL: Schema, DDL, and the Catalyst Optimizer

TLDR: Catalyst is Spark's query compiler. It takes any DataFrame operation or SQL string, parses it into an abstract syntax tree, resolves column references against the catalog, applies a library of a

24 min read

Article 7

Spark Executor Sizing: Memory Model, Core Tuning, and GC Strategy

TLDR: Spark executor OOMs are almost never caused by insufficient total cluster RAM — they are caused by misallocating memory across five distinct JVM regions while ignoring GC behavior and memoryOver

37 min read

Article 8

Kafka and Spark Structured Streaming: Building a Production Pipeline

📖 The 500K-Event Problem: When a Naive Kafka Consumer Falls Apart An analytics platform at a mid-sized fintech company needs to process 500,000 payment events per second from a Kafka cluster. The tea

23 min read

Article 9

Spark on Kubernetes: Operator, Dynamic Allocation, and Production Monitoring

TLDR: Running Spark on Kubernetes replaces YARN's static queue model with a container-native, elastically-scaled execution environment. The kubeflow Spark Operator manages SparkApplication CRDs throug

36 min read

Article 10

Partitioning in Spark: HashPartitioner, RangePartitioner, and Custom Strategies

TLDR: Spark's partition count and partitioning strategy are the two levers that determine whether a job scales linearly or crumbles under data growth. HashPartitioner distributes keys by hash modulo —

26 min read

Article 11

Reading and Writing Data in Spark: Parquet, Delta, JSON, and JDBC

TLDR: Parquet's columnar layout with row-group statistics enables predicate pushdown that can reduce a 500 GB scan to 8 GB. Delta Lake wraps Parquet with a JSON transaction log to add ACID semantics a

34 min read

Article 12

Shuffles in Spark: Why groupBy Kills Performance

TLDR: A Spark shuffle is the most expensive operation in any distributed job — it moves every matching key across the network, writes temporary sorted files to disk, and forces a hard synchronization

31 min read

Article 13

Stateful Aggregations in Spark Structured Streaming: mapGroupsWithState

TLDR: mapGroupsWithState gives each streaming key its own mutable state object, persisted in a fault-tolerant state store that checkpoints to object storage on every micro-batch. Where window aggregat

28 min read

Article 14

Spark Structured Streaming: Micro-Batch vs Continuous Processing

📖 The 15-Minute Gap: How a Fraud Team Discovered They Needed Real-Time Streaming A fintech team runs payment fraud detection with a well-tuned Spark batch job. Every 15 minutes it reads a day's worth

27 min read

Article 15

Watermarking and Late Data Handling in Spark Structured Streaming

TLDR: A watermark tells Spark Structured Streaming: "I will accept events up to N minutes late, and then I am done waiting." Spark tracks the maximum event time seen per partition, takes the global mi

27 min read

Apache Spark Engineering: Learning Roadmap

Your Spark jobs are slow, failing with OOM errors, or taking 10x longer than expected. You copy configurations from Stack Overflow, tweak executor memory, and nothing helps. You know Spark is powerful — but you're fighting it rather than using it.

Here's the challenge: Spark's surface API hides enormous internal complexity. A groupBy().agg() that looks simple can trigger a full shuffle of terabytes. This roadmap gives you a mental model of what Spark does under the hood — so you write code that works with the engine, not against it.

TLDR: Master Apache Spark from the ground up: understand the execution model (RDDs, DAGs, shuffle), learn DataFrames and Spark SQL, tune performance with partitioning and caching, implement Structured Streaming, and deploy production Spark jobs with confidence.

What You'll Learn

Understand Apache Spark Engineering through real published examples

Follow a sequence of 15 articles from fundamentals to deeper topics

Connect related concepts: big data, #apache-spark, data-engineering

Practice explaining trade-offs and implementation decisions

Prerequisites

Basic software engineering knowledge

Comfort reading technical articles

FAQs

How should I read this series?

Start from the first article if you are new, or use the article list to jump into the most relevant topic.

Is progress automatic?

Progress is based on articles opened from this browser using the local learning history.