Spark Streaming Understanding

When a streaming context starts, it will start a Job Scheduler:

Start ReceiverTracker: receive data from source and use BlockManager to save it to memoryStore(BlockManager)
Start JobGenerator: Periodically check interval, create Jobs based on graph given timestamp and OutputStream
Other resources: like job listener etc

Each scheduler is run in eventLoop

For a streaming application, it usually contains:

An InputStream: Some examples are SocketInputStream, KafkaInputStream, FileInputStream etc
Transformations: functions that generate another stream, example functions can be filter, map etc
OutputStream: Output actions of the streaming application, this will trigger stream to be materialized

Streaming Context includes a property graph which Input Stream and output stream are registered on.

How is data read from memoryStore?

Every DStream has methods to generateJob, getOrCompute, then compute (load from memory)

A simple class/entity relationship:

SparkStreamingContext:

sparkContext: sparkContext can be run by calling runJob method which invoke DagScheduler
graph: A list of DStream which includes output stream and input stream.
JobScheduler: See below

JobScheduler: eventLoop

JobGenerator: generate jobs based on ssc.graph at batchDuration interval, then submit jobs to JobExecutor
JobExecutor: execute submitted jobs from JobGenerator, job executed by calling sparkContext runJob func.
receiverTracker:

JobGenerator(jobScheduler): eventLoop

RecurringTimer: interval is batchDuration, trigger GenerateJob action
generateJobs: generate jobs for each outputStream in ssc.graph
Submit Jobs to jobScheduler

DStream(ssc):

foreachRDD: register as outputStream
register: add stream to outputStream that used to generateJobs()
generateJob,: call compute method to get RDDs and create Job object
getOrCompute/compute: get RDDs, either from parent or from memory if it’s InputDStream
generatedRDDs:

SparkContext

DagScheduler

DagScheduler: eventLoop

Create stages of a RDD based on RDD dependencies

1
2
3

val source: DStream
val transformed: DStream = source.transform() // transformed will store dependency source
val output = transformed.foreachRDD(println) // outputStream