Spark Structured Streaming Internal

Spark structured streaming is implemented in spark sql module.

Spark Session:

sparkSession is the entry to Spark sql and structured streaming.

  • Spark Context: spark context is the entry point of spark, so it’s naturally the bedrock of spark session as well.
  • readStream: Read streaming data in as a DataFrame, this method returns an DataStreamReader object
  • StreamingQueryManager: Managing the execution of all streaming quires
  • createDataFrame: Generate Data frames from various sources,(DataFrame=Dataset[Row])
  • createDataset:
  • sql: execute sql quires and return data as DataFrame

DataStreamReader and DataStreamWriter

  • DataStreamReader: Load streaming data from external sources
  • DataStreamWriter: Output streaming data to external sources
    DataStreamWriter has a start() method which calls df.sparkSession.sessionState.StreamingQueryManager.startQuery() to start streaming query.
    As you can see, the query is started through StreamingQueryManager.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
streamSource = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", KAFKA_BROKER) \
.option("subscribe", KAFKA_TOPIC) \
.option("startingOffsets", "earliest") \
.option("group.id", CONSUMER_ID) \
.load() // DataStreamReader

output = streamSource \
.writeStream \
.outputMode("complete") \
.trigger(processingTime = trigger_interval) \
.foreach(sink) \
.start() // DataStreamWriter

Dataset API

Dataset is a strong typed data structure used to do transformations and actions in structured streaming.

  • logicalPlan: After transformations and actions defined on a Dataset, it will be analyzed to logical plan, then optimized to physical plan for final execution.

Some features can be applied are:

  • withWatermark: A watermark defines a point in time before which we assume no more late data is going to arrive. This is useful for late data in streaming situation.
  • checkpoint: Apply dataset checkpointing, either eagerly or non eagerly.
  • cache: Cache dataset to memory.

    A lot transformations can be applied:

    • groupBy:
    • groupByKey: This will return a KeyValueGroupedDataset
    • agg:
    • repartition/coalesce: Returns a new dataset by specified partitioning.

StreamingQueryManager

Manages all the StreamingQuerys active in a SparkSession.

  • startQuery method: create query with DataFrame and call StreamingExecution start()

StreamingExecution

StreamingExecution is a implementation of trait StreamingQuery, it can be:

  • ContinousExecution: This mode doesn’t support complete outputMode and aggregation either.
  • MicrobatchExecution:

Some methods in StreamingExecution:

  • start(): start a thread of queryEexcutionThread
  • queryExecutionThread: This thread will run method runStream()
  • runStream(): The method to materialize the streaming query
  • runActivatedStream: All implementations need to implement this method. working on the logical plan