Spark structured streaming is implemented in spark sql module.
Spark Session:
sparkSession is the entry to Spark sql and structured streaming.
- Spark Context: spark context is the entry point of spark, so it’s naturally the bedrock of spark session as well.
- readStream: Read streaming data in as a DataFrame, this method returns an DataStreamReader object
- StreamingQueryManager: Managing the execution of all streaming quires
- createDataFrame: Generate Data frames from various sources,(DataFrame=Dataset[Row])
- createDataset:
- sql: execute sql quires and return data as DataFrame
DataStreamReader and DataStreamWriter
- DataStreamReader: Load streaming data from external sources
- DataStreamWriter: Output streaming data to external sources
DataStreamWriter has a start() method which callsdf.sparkSession.sessionState.StreamingQueryManager.startQuery()
to start streaming query.
As you can see, the query is started through StreamingQueryManager.
1 | streamSource = spark \ |
Dataset API
Dataset is a strong typed data structure used to do transformations and actions in structured streaming.
- logicalPlan: After transformations and actions defined on a Dataset, it will be analyzed to logical plan, then optimized to physical plan for final execution.
Some features can be applied are:
- withWatermark: A watermark defines a point in time before which we assume no more late data is going to arrive. This is useful for late data in streaming situation.
- checkpoint: Apply dataset checkpointing, either eagerly or non eagerly.
cache: Cache dataset to memory.
A lot transformations can be applied:
- groupBy:
- groupByKey: This will return a KeyValueGroupedDataset
- agg:
- repartition/coalesce: Returns a new dataset by specified partitioning.
StreamingQueryManager
Manages all the StreamingQuerys active in a SparkSession
.
startQuery
method: create query with DataFrame and call StreamingExecution start()
StreamingExecution
StreamingExecution is a implementation of trait StreamingQuery, it can be:
- ContinousExecution: This mode doesn’t support
complete
outputMode and aggregation either. - MicrobatchExecution:
Some methods in StreamingExecution:
- start(): start a thread of queryEexcutionThread
- queryExecutionThread: This thread will run method runStream()
- runStream(): The method to materialize the streaming query
- runActivatedStream: All implementations need to implement this method. working on the logical plan