Spark Streaming Understanding

When a streaming context starts, it will start a Job Scheduler:

  • Start ReceiverTracker: receive data from source and use BlockManager to save it to memoryStore(BlockManager)
  • Start JobGenerator: Periodically check interval, create Jobs based on graph given timestamp and OutputStream
  • Other resources: like job listener etc

Each scheduler is run in eventLoop

For a streaming application, it usually contains:

  • An InputStream: Some examples are SocketInputStream, KafkaInputStream, FileInputStream etc
  • Transformations: functions that generate another stream, example functions can be filter, map etc
  • OutputStream: Output actions of the streaming application, this will trigger stream to be materialized

Streaming Context includes a property graph which Input Stream and output stream are registered on.

How is data read from memoryStore?

  • Every DStream has methods to generateJob, getOrCompute, then compute (load from memory)

A simple class/entity relationship:

SparkStreamingContext:

  • sparkContext: sparkContext can be run by calling runJob method which invoke DagScheduler
  • graph: A list of DStream which includes output stream and input stream.
  • JobScheduler: See below

JobScheduler: eventLoop

  • JobGenerator: generate jobs based on ssc.graph at batchDuration interval, then submit jobs to JobExecutor
  • JobExecutor: execute submitted jobs from JobGenerator, job executed by calling sparkContext runJob func.
  • receiverTracker:

JobGenerator(jobScheduler): eventLoop

  • RecurringTimer: interval is batchDuration, trigger GenerateJob action
  • generateJobs: generate jobs for each outputStream in ssc.graph
  • Submit Jobs to jobScheduler

DStream(ssc):

  • foreachRDD: register as outputStream
  • register: add stream to outputStream that used to generateJobs()
  • generateJob,: call compute method to get RDDs and create Job object
  • getOrCompute/compute: get RDDs, either from parent or from memory if it’s InputDStream
  • generatedRDDs:

SparkContext

  • DagScheduler

DagScheduler: eventLoop

  • Create stages of a RDD based on RDD dependencies
1
2
3
val source: DStream
val transformed: DStream = source.transform() // transformed will store dependency source
val output = transformed.foreachRDD(println) // outputStream