What is RDD
  • Resilient -- if data is lost, data can be recreated
  • Distributed -- stored in nodes among the cluster
  • Dataset -- initial data comes from some distributed storage
What is RDD
  • A big data collection of data with following properties:
    • Immutable
    • Distributed
    • Lazily evaluated
    • Cacheable
Immutable and Distributed
  • Logical division of data
  • Derived from Hadoop Map/Reduce
  • All input, intermediate and output data will be represented as partitions
  • Partitions are basic unit of parallelism
  • RDD data is just collection of partitions
Partition from hdfs data
Hash partitions


  • Each RDD have access to it's parent RDD
  • NULL is the value of parent for first RDD
  • Before computing it's value, it always computes it's parent
  • This chain of running allows for laziness


  • cache internally uses persist API
  • persist sets a specific storage level for a given RDD
  • Spark context tracks persistent RDD
  • When first evaluates, partition will be put into memory by block manager

Block manager

  • Handles all in memory data in spark
  • Response for:
    • Cached Data (BlockRDD)
    • Shuffle Data
    • Broadcast Data
  • partition will be stored in Block with id (RDD.id, partition_index)

How caching works

  • Partition iterator checks the storage level
  • if Storage level set it calls
  • As iterator is run for each RDD partition, it's transparent to users

How to create RDD


  1. Transformations:
    • map
    • group-by
    • filter
    • union
  2. Actions:
    • reduce
    • count
    • collect
    • saveAsTextFile

Three important concepts

  1. Task
  2. Stage
  3. Job


One partition, One Task.



Separated by shuffle


One Action, One Job.

Job UI

Stage UI

Stage UI

Task UI

Extending spark API


  • Domain Specific Operators
    • Allows developer to express domain specific calculation in cleaner way
  • Domain Specific RDD's
    • Better way of expressing domain data
    • Control over partitioning and distribution