Apache Spark Guide — RDDs, DataFrames, and PySpark Examples
Apache Spark is a unified, open-source analytics engine for large-scale data processing, providing APIs in Python, SQL, Scala, and Java with in-memory computation for batch and streaming workloads.
What You’ll Learn
By the end of this tutorial, you’ll understand Spark’s core abstractions (RDDs, DataFrames), how lazy evaluation works, transformations vs actions, and how to perform data processing with PySpark including aggregation examples.
Why Spark Matters
Data volumes grow faster than single-machine processing power. Spark distributes computation across clusters, processing terabytes of data in memory. It’s the standard for large-scale ETL, ML training, and stream processing. DodaTech’s Durga Antivirus Pro uses Spark to process threat intelligence feeds across billions of events for pattern detection.
Apache Spark Learning Path
flowchart LR
A[Data Lakes] --> B[Apache Spark]
B --> C{You Are Here}
C --> D[RDDs]
C --> E[DataFrames]
C --> F[Spark SQL]
D --> G[Transformations]
D --> H[Actions]
E --> I[Aggregations]
E --> J[Joins]
What Is Apache Spark?
Think of Spark like a restaurant kitchen brigade. One cook (single machine) can prepare 50 meals an hour. With 10 cooks (Spark cluster), you get 10x the output — but only if the head chef (Spark driver) divides work efficiently, assigns stations (partitions), and coordinates timing (stages).
Spark does this automatically. It splits data into partitions, distributes work across worker nodes, and handles failures transparently.
Spark Architecture
flowchart TB
subgraph "Spark Cluster"
D[Driver Program
SparkContext]
CM[Cluster Manager
YARN / Mesos / Standalone]
D --> CM
CM --> W1[Worker 1
Executor]
CM --> W2[Worker 2
Executor]
CM --> W3[Worker N
Executor]
W1 --- P1[Partition 1]
W1 --- P2[Partition 2]
W2 --- P3[Partition 3]
W3 --- P4[Partition 4]
end
D --- |DataFrame API| USER[User Code]
RDDs (Resilient Distributed Datasets)
RDDs are the foundational abstraction in Spark — immutable, partitioned collections of objects that can be processed in parallel.
# rdd_basics.py
# Working with RDDs in PySpark
from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName("RDDDemo").setMaster("local[*]")
sc = SparkContext(conf=conf)
# Create RDD from a list
data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
rdd = sc.parallelize(data, numSlices=4)
print(f"Partitions: {rdd.getNumPartitions()}")
print(f"First element: {rdd.first()}")
# Transformations (lazy — nothing computed yet)
squared = rdd.map(lambda x: x ** 2)
even = rdd.filter(lambda x: x % 2 == 0)
# Actions (trigger computation)
print(f"Count: {squared.count()}")
print(f"Squared: {squared.collect()}")
print(f"Even: {even.collect()}")
print(f"Sum of evens: {even.reduce(lambda a, b: a + b)}")
sc.stop()Expected output:
Partitions: 4
First element: 1
Count: 10
Squared: [1, 4, 9, 16, 25, 36, 49, 64, 81, 100]
Even: [2, 4, 6, 8, 10]
Sum of evens: 30DataFrames: The Modern API
DataFrames are RDDs with schema — like a table in a relational database. They’re easier to use, automatically optimized by Spark’s Catalyst optimizer, and support SQL queries.
# dataframe_basics.py
# PySpark DataFrame operations
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, DoubleType, IntegerType
from pyspark.sql.functions import col, avg, sum as spark_sum, count, when
spark = SparkSession.builder \
.appName("DataFrameDemo") \
.config("spark.sql.adaptive.enabled", "true") \
.getOrCreate()
# Create DataFrame from data
data = [
("Alice", "Sales", 75000.0, 4),
("Bob", "Engineering", 95000.0, 6),
("Charlie", "Sales", 82000.0, 3),
("Diana", "Engineering", 105000.0, 8),
("Eve", "Marketing", 65000.0, 2),
("Frank", "Marketing", 72000.0, 5),
("Grace", "Engineering", 115000.0, 10),
("Henry", "Sales", 68000.0, 1),
]
schema = StructType([
StructField("name", StringType(), True),
StructField("department", StringType(), True),
StructField("salary", DoubleType(), True),
StructField("years_experience", IntegerType(), True),
])
df = spark.createDataFrame(data, schema)
print("=== Employee Data ===")
df.show()
# Filtering and transformation
print("=== Engineering Team Only ===")
df.filter(col("department") == "Engineering").show()
print("=== High Earners ($90k+) ===")
df.filter(col("salary") >= 90000).show()
# Aggregation
print("=== Average Salary by Department ===")
df.groupBy("department") \
.agg(
avg("salary").alias("avg_salary"),
spark_sum("salary").alias("total_salary"),
count("*").alias("employee_count")
) \
.orderBy("department") \
.show()
# Derived columns
print("=== Salary Bands ===")
df.withColumn("band",
when(col("salary") < 70000, "Junior")
.when(col("salary") < 90000, "Mid")
.when(col("salary") < 110000, "Senior")
.otherwise("Lead")
).show()
# Spark SQL
print("=== Spark SQL: Top Earners by Department ===")
df.createOrReplaceTempView("employees")
spark.sql("""
SELECT department, name, salary,
RANK() OVER (PARTITION BY department ORDER BY salary DESC) as rank
FROM employees
""").show()
spark.stop()Expected output:
=== Employee Data ===
+-------+-----------+------+-----------------+
| name| department|salary|years_experience|
+-------+-----------+------+-----------------+
| Alice| Sales|75000| 4|
| Bob|Engineering|95000| 6|
|Charlie| Sales|82000| 3|
| Diana|Engineering|105000| 8|
| Eve| Marketing|65000| 2|
| Frank| Marketing|72000| 5|
| Grace|Engineering|115000| 10|
| Henry| Sales|68000| 1|
+-------+-----------+------+-----------------+
=== Average Salary by Department ===
+-----------+------------------+------------+--------------+
|department| avg_salary|total_salary|employee_count|
+-----------+------------------+------------+--------------+
|Engineering|105000.0| 315000| 3|
| Marketing| 68500.0| 137000| 2|
| Sales| 75000.0| 225000| 3|
+-----------+------------------+------------+--------------+Transformations vs Actions
| Transformations | Actions | |
|---|---|---|
| What | Build a new RDD/DataFrame from existing one | Compute a result or write to storage |
| When | Lazy — recorded but not executed | Trigger actual computation |
| Return | RDD/DataFrame | Value (int, list, dict) or None (write) |
| Example | .map(), .filter(), .groupBy() | .count(), .collect(), .save() |
| Persistence | DAG of operations built | DAG executed and optimized |
# lazy_evaluation_demo.py
# Demonstrating Spark's lazy evaluation
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("LazyDemo").master("local[*]").getOrCreate()
data = list(range(1, 101))
df = spark.createDataFrame([(x,) for x in data], ["value"])
# These transformations build a DAG but don't execute
print("Step 1: Adding transformations (not executed yet)")
transformed = df \
.filter(col("value") % 2 == 0) \
.withColumn("squared", col("value") ** 2) \
.withColumn("cubed", col("value") ** 3)
# The action triggers execution
print("Step 2: Action triggers computation")
result = transformed.collect()
print(f"Result: {len(result)} rows")
# You can see the query plan
print("\n=== Query Plan (Explain) ===")
transformed.explain(True)
spark.stop()Expected output:
Step 1: Adding transformations (not executed yet)
Step 2: Action triggers computation
Result: 50 rows
=== Query Plan (Explain) ===
== Parsed Logical Plan ==
'Project [value#0L, (value#0L * value#0L) AS squared#..., ...]
...Lazy Evaluation and the Spark DAG
Spark builds a Directed Acyclic Graph (DAG) of transformations, then optimizes and executes it when an action is called. This enables:
- Pipeline optimization — Spark can reorder or combine operations
- Predicate pushdown — Filters are pushed closer to data sources
- Minimized shuffling — Spark plans efficient data movement
Common Spark Mistakes
1. Using collect() on Large Datasets
collect() brings all data to the driver. For a 1TB dataset, this crashes the driver. Use .take(N), .show(), or write to files instead.
2. Ignoring Data Skew
When one partition has 90% of the data, that executor does all the work while others sit idle. Salting keys or repartitioning can help.
3. Too Many Small Files
Writing millions of tiny Parquet files kills read performance. Coalesce to a reasonable number of partitions: df.coalesce(10).write.parquet(path).
4. Not Caching Reused DataFrames
If you use the same DataFrame in multiple actions, Spark recomputes it each time. Use df.cache() or df.persist() for frequently accessed data.
5. Using groupByKey Instead of reduceByKey
groupByKey shuffles all data; reduceByKey combines locally first. The latter is much faster for aggregations.
6. Wide Shuffles Without Tuning
Operations like groupBy, join, and distinct cause shuffles. Tune spark.sql.shuffle.partitions (default 200) based on data size.
Practice Questions
1. What is the difference between a transformation and an action in Spark?
Transformations (map, filter) build a DAG and are lazily evaluated. Actions (count, collect) trigger computation and return results to the driver.
2. What is lazy evaluation and why is it beneficial?
Spark delays computation until an action is called. This allows Spark to optimize the execution plan — combining operations, pushing filters to data sources, and minimizing shuffles.
3. How do DataFrames differ from RDDs?
DataFrames have schema (named columns), are optimized by Catalyst, support SQL queries, and use Tungsten for efficient memory management. RDDs are lower-level with no schema optimization.
4. What causes data skew and how can you fix it?
Uneven data distribution across partitions (e.g., “NULL” as a join key). Fix by salting keys (adding random prefixes), increasing partitions, or using broadcast joins for small tables.
5. Challenge: Write a PySpark job that reads 500GB of clickstream data, filters for 2026 events, aggregates by hour and page, and writes the result partitioned by date.
df = spark.read.parquet("s3://datalake/clickstream/")
filtered = df.filter(col("year") == 2026)
hourly = filtered.groupBy("date", "hour", "page").agg(count("*").alias("views"))
hourly.write.partitionBy("date").mode("overwrite").parquet("s3://analytics/hourly_views/")Mini Project: Sales Analytics with PySpark
# sales_analytics_spark.py
# Complete PySpark sales analysis
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum, avg, count, month, year, rank
from pyspark.sql.window import Window
spark = SparkSession.builder.appName("SalesAnalytics").master("local[*]").getOrCreate()
# Sample sales data
sales_data = [
("ORD-001", "2026-01-15", "Electronics", 1200.00, "US"),
("ORD-002", "2026-01-20", "Clothing", 150.00, "US"),
("ORD-003", "2026-02-01", "Electronics", 850.00, "EU"),
("ORD-004", "2026-02-10", "Books", 45.00, "US"),
("ORD-005", "2026-02-15", "Electronics", 2100.00, "APAC"),
("ORD-006", "2026-03-05", "Clothing", 320.00, "EU"),
("ORD-007", "2026-03-12", "Books", 95.00, "US"),
("ORD-008", "2026-03-20", "Electronics", 650.00, "US"),
("ORD-009", "2026-04-01", "Clothing", 180.00, "APAC"),
("ORD-010", "2026-04-10", "Electronics", 3100.00, "EU"),
]
columns = ["order_id", "order_date", "category", "amount", "region"]
sales_df = spark.createDataFrame(sales_data, columns)
# Add derived columns
enriched_df = sales_df \
.withColumn("month", month("order_date")) \
.withColumn("year", year("order_date"))
print("=== Monthly Revenue ===")
enriched_df.groupBy("month") \
.agg(sum("amount").alias("revenue"), count("*").alias("orders")) \
.orderBy("month") \
.show()
print("=== Top Categories by Revenue ===")
window_spec = Window.orderBy(col("revenue").desc())
enriched_df.groupBy("category") \
.agg(sum("amount").alias("revenue")) \
.withColumn("rank", rank().over(window_spec)) \
.show()
print("=== Regional Sales Summary ===")
enriched_df.groupBy("region") \
.agg(
sum("amount").alias("total_sales"),
avg("amount").alias("avg_order_value"),
count("*").alias("order_count")
) \
.orderBy(col("total_sales").desc()) \
.show()
spark.stop()Expected output:
=== Monthly Revenue ===
+-----+--------+------+
|month| revenue|orders|
+-----+--------+------+
| 1| 1350.00| 2|
| 2| 2995.00| 3|
| 3| 1065.00| 3|
| 4| 3280.00| 2|
+-----+--------+------+
=== Top Categories by Revenue ===
+-----------+--------+----+
| category| revenue|rank|
+-----------+--------+----+
|Electronics| 7900.00| 1|
| Clothing| 650.00| 2|
| Books| 140.00| 3|
+-----------+--------+----+
=== Regional Sales Summary ===
+------+-----------+---------------+-----------+
|region|total_sales|avg_order_value|order_count|
+------+-----------+---------------+-----------+
| US| 2095.00| 523.75| 4|
| EU| 4095.00| 1365.00| 3|
| APAC| 2390.00| 1195.00| 2|
+------+-----------+---------------+-----------+Related Concepts
What’s Next
You now understand Spark fundamentals! Next, explore stream processing with Spark Streaming and Kafka, then learn how to build complete data pipelines combining Spark with Airflow and dbt.
- Practice daily — Run PySpark locally on a CSV file with 100k+ rows
- Build a project — Create a Spark job that processes server logs and detects anomalies
- Explore related topics — Check out Spark MLlib for machine learning at scale
Remember: every expert was once a beginner. Keep coding!
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro