Spark structured streaming is implemented in spark sql module.

Spark Session:

sparkSession is the entry to Spark sql and structured streaming.

Spark Context: spark context is the entry point of spark, so it’s naturally the bedrock of spark session as well.
readStream: Read streaming data in as a DataFrame, this method returns an DataStreamReader object
StreamingQueryManager: Managing the execution of all streaming quires
createDataFrame: Generate Data frames from various sources,(DataFrame=Dataset[Row])
createDataset:
sql: execute sql quires and return data as DataFrame

DataStreamReader and DataStreamWriter

DataStreamReader: Load streaming data from external sources
DataStreamWriter: Output streaming data to external sources
DataStreamWriter has a start() method which calls df.sparkSession.sessionState.StreamingQueryManager.startQuery() to start streaming query.
As you can see, the query is started through StreamingQueryManager.

streamSource = spark \
  .readStream \
  .format("kafka") \
  .option("kafka.bootstrap.servers", KAFKA_BROKER) \
  .option("subscribe", KAFKA_TOPIC) \
  .option("startingOffsets", "earliest") \
  .option("group.id",  CONSUMER_ID) \
  .load()   // DataStreamReader

output = streamSource \
  .writeStream \
  .outputMode("complete") \
  .trigger(processingTime = trigger_interval) \
  .foreach(sink) \
  .start() // DataStreamWriter

Dataset API

Dataset is a strong typed data structure used to do transformations and actions in structured streaming.

logicalPlan: After transformations and actions defined on a Dataset, it will be analyzed to logical plan, then optimized to physical plan for final execution.

Some features can be applied are:

withWatermark: A watermark defines a point in time before which we assume no more late data is going to arrive. This is useful for late data in streaming situation.
checkpoint: Apply dataset checkpointing, either eagerly or non eagerly.
cache: Cache dataset to memory.

A lot transformations can be applied:
- groupBy:
- groupByKey: This will return a KeyValueGroupedDataset
- agg:
- repartition/coalesce: Returns a new dataset by specified partitioning.

StreamingQueryManager

Manages all the StreamingQuerys active in a SparkSession.

startQuery method: create query with DataFrame and call StreamingExecution start()

StreamingExecution

StreamingExecution is a implementation of trait StreamingQuery, it can be:

ContinousExecution: This mode doesn’t support complete outputMode and aggregation either.
MicrobatchExecution:

Some methods in StreamingExecution:

start(): start a thread of queryEexcutionThread
queryExecutionThread: This thread will run method runStream()
runStream(): The method to materialize the streaming query
runActivatedStream: All implementations need to implement this method. working on the logical plan