Spark structured streaming is implemented in spark sql module.
Spark Session:
sparkSession is the entry to Spark sql and structured streaming.
- Spark Context: spark context is the entry point of spark, so it’s naturally the bedrock of spark session as well.
- readStream: Read streaming data in as a DataFrame, this method returns an DataStreamReader object
- StreamingQueryManager: Managing the execution of all streaming quires
- createDataFrame: Generate Data frames from various sources,(DataFrame=Dataset[Row])
- createDataset:
- sql: execute sql quires and return data as DataFrame
DataStreamReader and DataStreamWriter
- DataStreamReader: Load streaming data from external sources
- DataStreamWriter: Output streaming data to external sources
 DataStreamWriter has a start() method which callsdf.sparkSession.sessionState.StreamingQueryManager.startQuery()to start streaming query.
 As you can see, the query is started through StreamingQueryManager.
| 1 | streamSource = spark \ | 
Dataset API
Dataset is a strong typed data structure used to do transformations and actions in structured streaming.
- logicalPlan: After transformations and actions defined on a Dataset, it will be analyzed to logical plan, then optimized to physical plan for final execution.
Some features can be applied are:
- withWatermark: A watermark defines a point in time before which we assume no more late data is going to arrive. This is useful for late data in streaming situation.
- checkpoint: Apply dataset checkpointing, either eagerly or non eagerly.
- cache: Cache dataset to memory. - A lot transformations can be applied: - groupBy:
- groupByKey: This will return a KeyValueGroupedDataset
- agg:
- repartition/coalesce: Returns a new dataset by specified partitioning.
 
StreamingQueryManager
Manages all the StreamingQuerys active in a SparkSession.
- startQuerymethod: create query with DataFrame and call StreamingExecution start()
StreamingExecution
StreamingExecution is a implementation of trait StreamingQuery, it can be:
- ContinousExecution: This mode doesn’t support completeoutputMode and aggregation either.
- MicrobatchExecution:
Some methods in StreamingExecution:
- start(): start a thread of queryEexcutionThread
- queryExecutionThread: This thread will run method runStream()
- runStream(): The method to materialize the streaming query
- runActivatedStream: All implementations need to implement this method. working on the logical plan