What is RDD
  • Resilient -- if data is lost, data can be recreated
  • Distributed -- stored in nodes among the cluster
  • Dataset -- initial data comes from some distributed storage
What is RDD
  • A big data collection of data with following properties:
    • Immutable
    • Distributed
    • Lazily evaluated
    • Cacheable
Immutable and Distributed
Partitions
  • Logical division of data
  • Derived from Hadoop Map/Reduce
  • All input, intermediate and output data will be represented as partitions
  • Partitions are basic unit of parallelism
  • RDD data is just collection of partitions
Partition from hdfs data
Hash partitions

Laziness

  • Each RDD have access to it's parent RDD
  • NULL is the value of parent for first RDD
  • Before computing it's value, it always computes it's parent
  • This chain of running allows for laziness

Caching

  • cache internally uses persist API
  • persist sets a specific storage level for a given RDD
  • Spark context tracks persistent RDD
  • When first evaluates, partition will be put into memory by block manager

Block manager

  • Handles all in memory data in spark
  • Response for:
    • Cached Data (BlockRDD)
    • Shuffle Data
    • Broadcast Data
  • partition will be stored in Block with id (RDD.id, partition_index)

How caching works

  • Partition iterator checks the storage level
  • if Storage level set it calls
    
    cacheManager.getOrCompute(partition)
    
  • As iterator is run for each RDD partition, it's transparent to users

How to create RDD

Operations


  1. Transformations:
    • map
    • group-by
    • filter
    • union
  2. Actions:
    • reduce
    • count
    • collect
    • saveAsTextFile

Three important concepts


  1. Task
  2. Stage
  3. Job

Task

One partition, One Task.

Shuffle

Stage

Separated by shuffle

Job

One Action, One Job.

Job UI

Stage UI

Stage UI

Task UI

Extending spark API

Why

  • Domain Specific Operators
    • Allows developer to express domain specific calculation in cleaner way
  • Domain Specific RDD's
    • Better way of expressing domain data
    • Control over partitioning and distribution