When a streaming context starts, it will start a Job Scheduler:
- Start ReceiverTracker: receive data from source and use BlockManager to save it to memoryStore(BlockManager)
- Start JobGenerator: Periodically check interval, create Jobs based on graph given timestamp and OutputStream
- Other resources: like job listener etc
Each scheduler is run in eventLoop
For a streaming application, it usually contains:
- An InputStream: Some examples are SocketInputStream, KafkaInputStream, FileInputStream etc
- Transformations: functions that generate another stream, example functions can be filter, map etc
- OutputStream: Output actions of the streaming application, this will trigger stream to be materialized
Streaming Context includes a property graph
which Input Stream and output stream are registered on.
How is data read from memoryStore?
- Every DStream has methods to generateJob, getOrCompute, then compute (load from memory)
A simple class/entity relationship:
SparkStreamingContext:
- sparkContext: sparkContext can be run by calling runJob method which invoke DagScheduler
- graph: A list of DStream which includes output stream and input stream.
- JobScheduler: See below
JobScheduler: eventLoop
- JobGenerator: generate jobs based on
ssc.graph
atbatchDuration
interval, then submit jobs to JobExecutor - JobExecutor: execute submitted jobs from JobGenerator, job executed by calling sparkContext runJob func.
- receiverTracker:
JobGenerator(jobScheduler): eventLoop
- RecurringTimer: interval is batchDuration, trigger
GenerateJob
action - generateJobs: generate jobs for each outputStream in
ssc.graph
- Submit Jobs to jobScheduler
DStream(ssc):
- foreachRDD: register as outputStream
- register: add stream to outputStream that used to generateJobs()
- generateJob,: call compute method to get RDDs and create Job object
- getOrCompute/compute: get RDDs, either from parent or from memory if it’s InputDStream
- generatedRDDs:
SparkContext
- DagScheduler
DagScheduler: eventLoop
- Create stages of a RDD based on RDD dependencies
1 | val source: DStream |