Designing Data-Intensive Applications (6) - Data Partition

This is my personal learning notes from reading <> Chapter 6.

What is data partition?

Partitioning means instead of parking all the data in the same place, we divide them into multiple pieces, and each piece of data stores on its own partition.

The purpose of data partition:

  • Scalability

What makes partitioning difficult?

  • How to assign data to a partition? we want to avoid skewed partition, hot spot partition.

Solution

Partitioning by key range

This is to pre-assign a range of keys to each partition, but certain assigned keys can lead to hot spot still. To resolve this, you can add timestamp to the key.

This way makes it fit better for range queries as data can be sorted inside each partition plus we know the location of keys.

Partitioning by hash of key

Refer to Consistent hashing, this is introduced by the creation of Amazon DynamoDB, Cassandra benefits the same way.

Rebalancing parititons

This probably falls to DevOps world, Rebalancing partitions is needed when:

  • A machine failed, you need to move the data to other nodes
  • Simply want to add more nodes to balance the data because of query throughput increase

To achieve this you can:

  • Use fixed number of partitions, but make sure number of partitions are more than the nodes, so when rebalancing, you just need to move around the entire partition to different nodes, data partitioning strategy is not affected.

  • Dynamic partitioning: This is good for key-ranged partitioning.

  • Partitioning Proportionally to nodes: When a new node joins the cluster, it randomly chooses a fixed number of existing partitions to split, and then takes ownership of one half of each of those split partitions while leaving the other half of each partition in place. This is basically what Cassandra does, this way works because of hash based partitioning.

Request Routing

This basically requires the cluster to provide information about the partition of a certain key.
A central coordinator like Zookeeper can be used.

Some databases also use a decentralized way which means every node keeps a copy of the cluster information so that requests can be served by any node, Cassandra uses Gossip protocol to keep every node updated.