Skip to content
Apache Spark Guide — RDDs, DataFrames, and PySpark Examples

Apache Spark Guide — RDDs, DataFrames, and PySpark Examples

DodaTech Updated Jun 15, 2026 8 min read

Apache Spark is a unified, open-source analytics engine for large-scale data processing, providing APIs in Python, SQL, Scala, and Java with in-memory computation for batch and streaming workloads.

What You’ll Learn

By the end of this tutorial, you’ll understand Spark’s core abstractions (RDDs, DataFrames), how lazy evaluation works, transformations vs actions, and how to perform data processing with PySpark including aggregation examples.

Why Spark Matters

Data volumes grow faster than single-machine processing power. Spark distributes computation across clusters, processing terabytes of data in memory. It’s the standard for large-scale ETL, ML training, and stream processing. DodaTech’s Durga Antivirus Pro uses Spark to process threat intelligence feeds across billions of events for pattern detection.

Apache Spark Learning Path


flowchart LR
  A[Data Lakes] --> B[Apache Spark]
  B --> C{You Are Here}
  C --> D[RDDs]
  C --> E[DataFrames]
  C --> F[Spark SQL]
  D --> G[Transformations]
  D --> H[Actions]
  E --> I[Aggregations]
  E --> J[Joins]

Prerequisites: Python fundamentals, SQL basics. Understanding of data lakes and distributed computing concepts helps.

What Is Apache Spark?

Think of Spark like a restaurant kitchen brigade. One cook (single machine) can prepare 50 meals an hour. With 10 cooks (Spark cluster), you get 10x the output — but only if the head chef (Spark driver) divides work efficiently, assigns stations (partitions), and coordinates timing (stages).

Spark does this automatically. It splits data into partitions, distributes work across worker nodes, and handles failures transparently.

Spark Architecture


flowchart TB
  subgraph "Spark Cluster"
    D[Driver Program
SparkContext] CM[Cluster Manager
YARN / Mesos / Standalone] D --> CM CM --> W1[Worker 1
Executor] CM --> W2[Worker 2
Executor] CM --> W3[Worker N
Executor] W1 --- P1[Partition 1] W1 --- P2[Partition 2] W2 --- P3[Partition 3] W3 --- P4[Partition 4] end D --- |DataFrame API| USER[User Code]

RDDs (Resilient Distributed Datasets)

RDDs are the foundational abstraction in Spark — immutable, partitioned collections of objects that can be processed in parallel.

# rdd_basics.py
# Working with RDDs in PySpark
from pyspark import SparkContext, SparkConf

conf = SparkConf().setAppName("RDDDemo").setMaster("local[*]")
sc = SparkContext(conf=conf)

# Create RDD from a list
data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
rdd = sc.parallelize(data, numSlices=4)
print(f"Partitions: {rdd.getNumPartitions()}")
print(f"First element: {rdd.first()}")

# Transformations (lazy — nothing computed yet)
squared = rdd.map(lambda x: x ** 2)
even = rdd.filter(lambda x: x % 2 == 0)

# Actions (trigger computation)
print(f"Count: {squared.count()}")
print(f"Squared: {squared.collect()}")
print(f"Even: {even.collect()}")
print(f"Sum of evens: {even.reduce(lambda a, b: a + b)}")

sc.stop()

Expected output:

Partitions: 4
First element: 1
Count: 10
Squared: [1, 4, 9, 16, 25, 36, 49, 64, 81, 100]
Even: [2, 4, 6, 8, 10]
Sum of evens: 30

DataFrames: The Modern API

DataFrames are RDDs with schema — like a table in a relational database. They’re easier to use, automatically optimized by Spark’s Catalyst optimizer, and support SQL queries.

# dataframe_basics.py
# PySpark DataFrame operations
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, DoubleType, IntegerType
from pyspark.sql.functions import col, avg, sum as spark_sum, count, when

spark = SparkSession.builder \
    .appName("DataFrameDemo") \
    .config("spark.sql.adaptive.enabled", "true") \
    .getOrCreate()

# Create DataFrame from data
data = [
    ("Alice", "Sales", 75000.0, 4),
    ("Bob", "Engineering", 95000.0, 6),
    ("Charlie", "Sales", 82000.0, 3),
    ("Diana", "Engineering", 105000.0, 8),
    ("Eve", "Marketing", 65000.0, 2),
    ("Frank", "Marketing", 72000.0, 5),
    ("Grace", "Engineering", 115000.0, 10),
    ("Henry", "Sales", 68000.0, 1),
]

schema = StructType([
    StructField("name", StringType(), True),
    StructField("department", StringType(), True),
    StructField("salary", DoubleType(), True),
    StructField("years_experience", IntegerType(), True),
])

df = spark.createDataFrame(data, schema)
print("=== Employee Data ===")
df.show()

# Filtering and transformation
print("=== Engineering Team Only ===")
df.filter(col("department") == "Engineering").show()

print("=== High Earners ($90k+) ===")
df.filter(col("salary") >= 90000).show()

# Aggregation
print("=== Average Salary by Department ===")
df.groupBy("department") \
    .agg(
        avg("salary").alias("avg_salary"),
        spark_sum("salary").alias("total_salary"),
        count("*").alias("employee_count")
    ) \
    .orderBy("department") \
    .show()

# Derived columns
print("=== Salary Bands ===")
df.withColumn("band",
    when(col("salary") < 70000, "Junior")
    .when(col("salary") < 90000, "Mid")
    .when(col("salary") < 110000, "Senior")
    .otherwise("Lead")
).show()

# Spark SQL
print("=== Spark SQL: Top Earners by Department ===")
df.createOrReplaceTempView("employees")
spark.sql("""
    SELECT department, name, salary,
           RANK() OVER (PARTITION BY department ORDER BY salary DESC) as rank
    FROM employees
""").show()

spark.stop()

Expected output:

=== Employee Data ===
+-------+-----------+------+-----------------+
|   name| department|salary|years_experience|
+-------+-----------+------+-----------------+
|  Alice|      Sales|75000|                4|
|    Bob|Engineering|95000|                6|
|Charlie|      Sales|82000|                3|
|  Diana|Engineering|105000|               8|
|    Eve| Marketing|65000|                2|
|  Frank| Marketing|72000|                5|
|  Grace|Engineering|115000|              10|
|  Henry|      Sales|68000|                1|
+-------+-----------+------+-----------------+

=== Average Salary by Department ===
+-----------+------------------+------------+--------------+
|department|       avg_salary|total_salary|employee_count|
+-----------+------------------+------------+--------------+
|Engineering|105000.0|      315000|             3|
| Marketing| 68500.0|      137000|             2|
|      Sales| 75000.0|      225000|             3|
+-----------+------------------+------------+--------------+

Transformations vs Actions

TransformationsActions
WhatBuild a new RDD/DataFrame from existing oneCompute a result or write to storage
WhenLazy — recorded but not executedTrigger actual computation
ReturnRDD/DataFrameValue (int, list, dict) or None (write)
Example.map(), .filter(), .groupBy().count(), .collect(), .save()
PersistenceDAG of operations builtDAG executed and optimized
# lazy_evaluation_demo.py
# Demonstrating Spark's lazy evaluation

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("LazyDemo").master("local[*]").getOrCreate()

data = list(range(1, 101))
df = spark.createDataFrame([(x,) for x in data], ["value"])

# These transformations build a DAG but don't execute
print("Step 1: Adding transformations (not executed yet)")
transformed = df \
    .filter(col("value") % 2 == 0) \
    .withColumn("squared", col("value") ** 2) \
    .withColumn("cubed", col("value") ** 3)

# The action triggers execution
print("Step 2: Action triggers computation")
result = transformed.collect()
print(f"Result: {len(result)} rows")

# You can see the query plan
print("\n=== Query Plan (Explain) ===")
transformed.explain(True)

spark.stop()

Expected output:

Step 1: Adding transformations (not executed yet)
Step 2: Action triggers computation
Result: 50 rows

=== Query Plan (Explain) ===
== Parsed Logical Plan ==
'Project [value#0L, (value#0L * value#0L) AS squared#..., ...]
...

Lazy Evaluation and the Spark DAG

Spark builds a Directed Acyclic Graph (DAG) of transformations, then optimizes and executes it when an action is called. This enables:

  • Pipeline optimization — Spark can reorder or combine operations
  • Predicate pushdown — Filters are pushed closer to data sources
  • Minimized shuffling — Spark plans efficient data movement

Common Spark Mistakes

1. Using collect() on Large Datasets

collect() brings all data to the driver. For a 1TB dataset, this crashes the driver. Use .take(N), .show(), or write to files instead.

2. Ignoring Data Skew

When one partition has 90% of the data, that executor does all the work while others sit idle. Salting keys or repartitioning can help.

3. Too Many Small Files

Writing millions of tiny Parquet files kills read performance. Coalesce to a reasonable number of partitions: df.coalesce(10).write.parquet(path).

4. Not Caching Reused DataFrames

If you use the same DataFrame in multiple actions, Spark recomputes it each time. Use df.cache() or df.persist() for frequently accessed data.

5. Using groupByKey Instead of reduceByKey

groupByKey shuffles all data; reduceByKey combines locally first. The latter is much faster for aggregations.

6. Wide Shuffles Without Tuning

Operations like groupBy, join, and distinct cause shuffles. Tune spark.sql.shuffle.partitions (default 200) based on data size.

Practice Questions

1. What is the difference between a transformation and an action in Spark?

Transformations (map, filter) build a DAG and are lazily evaluated. Actions (count, collect) trigger computation and return results to the driver.

2. What is lazy evaluation and why is it beneficial?

Spark delays computation until an action is called. This allows Spark to optimize the execution plan — combining operations, pushing filters to data sources, and minimizing shuffles.

3. How do DataFrames differ from RDDs?

DataFrames have schema (named columns), are optimized by Catalyst, support SQL queries, and use Tungsten for efficient memory management. RDDs are lower-level with no schema optimization.

4. What causes data skew and how can you fix it?

Uneven data distribution across partitions (e.g., “NULL” as a join key). Fix by salting keys (adding random prefixes), increasing partitions, or using broadcast joins for small tables.

5. Challenge: Write a PySpark job that reads 500GB of clickstream data, filters for 2026 events, aggregates by hour and page, and writes the result partitioned by date.

df = spark.read.parquet("s3://datalake/clickstream/")
filtered = df.filter(col("year") == 2026)
hourly = filtered.groupBy("date", "hour", "page").agg(count("*").alias("views"))
hourly.write.partitionBy("date").mode("overwrite").parquet("s3://analytics/hourly_views/")

Mini Project: Sales Analytics with PySpark

# sales_analytics_spark.py
# Complete PySpark sales analysis
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum, avg, count, month, year, rank
from pyspark.sql.window import Window

spark = SparkSession.builder.appName("SalesAnalytics").master("local[*]").getOrCreate()

# Sample sales data
sales_data = [
    ("ORD-001", "2026-01-15", "Electronics", 1200.00, "US"),
    ("ORD-002", "2026-01-20", "Clothing", 150.00, "US"),
    ("ORD-003", "2026-02-01", "Electronics", 850.00, "EU"),
    ("ORD-004", "2026-02-10", "Books", 45.00, "US"),
    ("ORD-005", "2026-02-15", "Electronics", 2100.00, "APAC"),
    ("ORD-006", "2026-03-05", "Clothing", 320.00, "EU"),
    ("ORD-007", "2026-03-12", "Books", 95.00, "US"),
    ("ORD-008", "2026-03-20", "Electronics", 650.00, "US"),
    ("ORD-009", "2026-04-01", "Clothing", 180.00, "APAC"),
    ("ORD-010", "2026-04-10", "Electronics", 3100.00, "EU"),
]

columns = ["order_id", "order_date", "category", "amount", "region"]
sales_df = spark.createDataFrame(sales_data, columns)

# Add derived columns
enriched_df = sales_df \
    .withColumn("month", month("order_date")) \
    .withColumn("year", year("order_date"))

print("=== Monthly Revenue ===")
enriched_df.groupBy("month") \
    .agg(sum("amount").alias("revenue"), count("*").alias("orders")) \
    .orderBy("month") \
    .show()

print("=== Top Categories by Revenue ===")
window_spec = Window.orderBy(col("revenue").desc())
enriched_df.groupBy("category") \
    .agg(sum("amount").alias("revenue")) \
    .withColumn("rank", rank().over(window_spec)) \
    .show()

print("=== Regional Sales Summary ===")
enriched_df.groupBy("region") \
    .agg(
        sum("amount").alias("total_sales"),
        avg("amount").alias("avg_order_value"),
        count("*").alias("order_count")
    ) \
    .orderBy(col("total_sales").desc()) \
    .show()

spark.stop()

Expected output:

=== Monthly Revenue ===
+-----+--------+------+
|month| revenue|orders|
+-----+--------+------+
|    1| 1350.00|    2|
|    2| 2995.00|    3|
|    3| 1065.00|    3|
|    4| 3280.00|    2|
+-----+--------+------+

=== Top Categories by Revenue ===
+-----------+--------+----+
|   category| revenue|rank|
+-----------+--------+----+
|Electronics| 7900.00|   1|
|   Clothing|  650.00|   2|
|      Books|  140.00|   3|
+-----------+--------+----+

=== Regional Sales Summary ===
+------+-----------+---------------+-----------+
|region|total_sales|avg_order_value|order_count|
+------+-----------+---------------+-----------+
|    US|    2095.00|        523.75|          4|
|    EU|    4095.00|       1365.00|          3|
|  APAC|    2390.00|       1195.00|          2|
+------+-----------+---------------+-----------+

Related Concepts

What’s Next

You now understand Spark fundamentals! Next, explore stream processing with Spark Streaming and Kafka, then learn how to build complete data pipelines combining Spark with Airflow and dbt.

  • Practice daily — Run PySpark locally on a CSV file with 100k+ rows
  • Build a project — Create a Spark job that processes server logs and detects anomalies
  • Explore related topics — Check out Spark MLlib for machine learning at scale

Remember: every expert was once a beginner. Keep coding!

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro