Apache Pinot
Pinot is a realtime distributed OLAP datastore for scalable real time analytics with low latency.
It’s using Apache Helix for cluster management, data stored as segment in Pinot.
Components:
Controller
Manage brokers and servers, responsible for assigning segments to servers.
Broker
Accepting queries from clients and return query results to clients.
Server
Hosts one or more segments, and respond to queries from broker.
Data is stored as segment in servers. Segment is a columnar storage.
Data Ingestion
Data can be ingested in real time or offline ingestion mode
Real time ingestion
Data from Kafka or other streaming source can be ingested to Pinot servers directly, and can serve query right away.
Offline ingestion
Data in storage can be ingested through Pinot controller, and pinot controller will assign segments to Pinot servers.
- Add Schema
- Add Table
- Create Segment
- Upload Segment
Query
Pinot query language is very similar to standard query language except that JOIN
and LIMIT
are not supported.
1 | SELECT <outputColumn> (, outputColumn, outputColumn,...) |
Indexing technology
Apache Druid
Druid is very similar to Pinot in many ways: both for real time queries, both support real time and offline ingestions. Instead of Helix, Druid uses Apache Zookeeper for coordination.
Components:
Master Server(Coordinator and overlord processes)
Manages data availability and ingestion, similar to Pinot controller.
Query Server(Broker and Router)
Same as Pinot, accepting queries from external clients, routing queries to brokers, coordinators and overlords.
Data Server(Historical and Middle Manager processes)
This is similar to server in Pinot, handles ingestion workloads and stores all queryable data.
Druid also provides a Deep Storage component as backup of data.
Data is stored as segment in Druid as well, but Druid segment always comes with a timestamp.
Druid supports tiering which allows old data can be moved to clusters with more disk storage but less memory and CPU, This can improve query efficiency.
Data Ingestion
Druid also support batch and real time ingestion.
Query
Druid’s native query language is JSON over HTTP, beside this, Druid also provides Druid SQL.
Indexing
Presto
Presto was designed for OLAP to handle data warehousing and analytics: data analysis, aggregating large amounts of data and producing reports. But unlike Pinot and Druid, Presto is used to connect and query from external data sources, varies from HDFS to Cassandra and traditional database like MySQL.
Coordinator
Parsing statements, planning queries, and managing Presto worker nodes.
Server
Executing tasks and processing data, Worker nodes fetch data from connectors and exchange intermediate data with each other.
Data Sources
Since Presto has no its own data storage, it relies on different kind of connectors to get data.
Data is then modeled as Catalog, schema and table in Presto.
Query Execution
Statement -> Queries -> Stages -> Tasks
ClickHouse
Extra Reads
https://medium.com/@leventov/design-of-a-cost-efficient-time-series-store-for-big-data-88c5dc41af8e