The first streaming framework I got to know is Apache Spark, my team owns a small spark cluster which has 1 leader and 4 followers(It is said that master/slave is not good words now). We use spark streaming to read messages from Kafka and rollup metrics, then send to RabbitMQ which will be used to send out notifications.
There are 20 million questions answered each day on our platform, to better keep customers engaged and provide them better insights, it’s nice to send notifications to the survey owners about the responses they collected, this needs to be happen in a timely manner, so Apache Spark seems to be a good fit.
Spark is so popular at that time, it is the only streaming framework I know, and I even thought that is the only streaming framework ever exists. Later on, I heard more and more voices that spark is not a real streaming tool, instead a batching process. Now I understand that Apache spark is for batch processing and spark streaming is for mini-batch processing. With the arising of LAMBDA framework, spark becomes popular since it can do both batch and “streaming” process.
There are some other streaming frameworks in the market, Apache Storm, Apache Samza, Apache Flink etc. It is easy to get lost when facing too many choices, it is also true that each of them has suitable use cases. what’s important is to understand the problem we are going to solve.
Here are some facts of those frameworks:
Apache Storm
Open sourced by Twitter and was initially released in 2011.
Key concepts:
- Spouts: (data source)
- Bolts: functions
- Tuple: stream
Storm is able to process +1M msgs/second/node, A cluster can be configured to have > 1k workers, Storm provides at least once delivery
On top of its core, Storm also provides Trident API for micro batch processing,
it assures exactly once guarantee.
With Trident API, you can do aggregation, merge, join, grouping, functions, filters etc.
Storm supports data transfer protocol: Avro, thrift.
Apache Flink
Flink is originated from an academic project called Stratosphere
, it had the first release after transferring to Apache incubator at 2014.
Flink unified stream and batch processing, streaming is the nature of flink, and batch is regarded as a special case of streaming which in contrast of Spark.
Flink provides exactly once processing.
It has rich higher level API compared to Storm, also support out of order process
It is able to process 1.5M msgs/second/node
Apache Spark
Originally developed at the University of California, Berkeley’s AMPLab in 2009 and open sourced in 2010.
Spark can do exactly once delivery as well, it is working with RDD, but now suggested to use DataFrame/DataSet.
Spark is also known as hard to tune those complicate parameters for better performance.
Apache Samza
The project entered Apache Incubator in 2013 and was originally created at LinkedIn.
Samza achieves at least once delivery.
Samza is famous for its state management based on RockDB, but it’s tightly coupled with Kafka and Yarn.
It lacks some advanced features like watermarks, sessions.