Database Partitioning

5 min readOct 18, 2020

Summary on database partitioning/sharding strategies. Partitioning and sharding are interchangeable words.

Partitioning means splitting up the dataset into multiple partitions so that each row/document belongs to exactly one partition. The motivation is to split up the workload(CPU, memory, disk) of each partition so that our database is scalable(i.e being able to add more resources to cope with higher load without sacrificing performance).

There are mainly two ways of achieving partitioning, one is key range based partitioning, the other is hash based partitioning.

Key Range Based Partitioning

In this scheme, partitioning is based on sorted key ranges. For example a dictionary could be broken into 5 partitions based on key ranges: A-E, F-J, K-O, P-T, U-Z. Within each partition, keys are sorted so that range scan is efficient. Note that partition may not be evenly distributed, for example there are much more words in A-E than there are in U-Z.

The partition boundary is chosen by database administrator(you), or it could be chosen automatically(dynamic partitioning).

This partitioning strategy is used by HBase, RethinkDB and MongoDB.

A downside to this strategy is there can be “hot spots”, like some partitions get queries much more than the rest. To alleviate the problem, we could mutate the hot keys so they’re stored in multiple partitions. For example if user id “xxx” is a hot key, make it “A-xxx”, “C-xxx”, “E-xxx”. However this would also add overheads for querying because to fetch data related to user “xxx”, we now need to query multiple partitions.

Hash Based Partitioning

In this scheme, partitioning is bashed on hash(key) instead of the key itself. Thus, a key is stored in some partition if hash(key) falls into the range that is assigned to that partition. With a good, uniform hash function, the distribution can be quite even which addresses the downside of key range based partitioning. Down side of this strategy is that since keys aren’t sorted anymore, range scan will not be efficient. However, database like Cassandra achieves a compromise: A table in Cassandra can be declared with a compounded primary key consisting of several columns, only the first part of the compounded primary key is used to identify the partition, rest of the compounded primary key is used to sort the rows. For example, if you design a Twitter app and use (user_id, tweet_timestamp) as key, then user_id will be used to identify the partition and then tweet_timestamp can be used for performing efficient range scan query(e.g all tweets posted by user 123 between Dec 2019 and June 2020)

Partitioning and Secondary Index

Above strategies only involve primary key which uses primary index, what if we have secondary index? how would it interact with partitioning?

There are mainly 2 ways to deal with secondary index: document-based and term-based.

Document-based partitioning with secondary index

In this way, each partition stores secondary index that’s only related to rows/documents stored in this partition. In other words, secondary index stored in this partition knows nothing about rows/documents in other partitions. For this reason, such index is also called local index. The idea is indicated in following figure:

A downside to this method, as you can see from the figure, is that: if a user makes a query like this

select * from CARS where color='red'

all partitions need to be queried in order to fetch all the cars with color red! This is just too slow.

On the good side, a write operation to a row/document only needs to write to a single partition.

Term-based partitioning with secondary index

Since the document-based method is too slow for reads because of having to query multiple partition, how about having the secondary index be aware of rows/documents in all other partitions? Well, in that case your index will be too big, which defeats the whole purpose of partitioning. Instead, we want to split the secondary index’s responsibility into multiple partitions. For example, if we have 2 partitions, partition1 only knows about rows/documents with color red to brown, and partition2 only knows about rows/documents with color green to black, etc… Thus, when we need to make query

select * from CARS where color='red'

we know partition 1 has the answers!

When we need to write to a row/document in certain partition, only the secondary index within that partition needs to be updated! Isn’t that brilliant?

Rebalancing partitions

As we add/remove nodes for handling request, there’s a need to ensure each partition gets similar amount of traffic.

Consistent Hashing

With hash based partitioning, a naive approach to select a partition for certain hash(key) is:

1 Divide up the possible hashes into N partitions

2 selected partition = hash(key) % N

Note that with this approach, 1 node handles 1 partition.

The problem with this approach is that, when N changes as we add/remove nodes, the selected partition for same keys would change, most of them. In order to still correctly handle user queries, we have to move data between partitions in order to serve queries accurately. This is inefficient if most data in our database needs to move whenever a node is add/removed, causing drop in availability, impact performance as the data moving takes all the IO bandwidth. To solve this problem, we need Consistent Hashing. See my previous blog post on that: https://uofiszhou42.medium.com/consistent-hashing-explained-8a64ae7e3993

With Consistent Hashing, our partitions become balanced in spite of operational changes.

Fixed number of partitions

Another approach for achieving balanced partitions is having M=1000 partitions and N=10 nodes, so each node handles 100 partitions.

When adding a new node, steals some partitions from existing partitions.

When removing a node, assign partitions of that node to existing partitions.

In this case, if a node is removed/added, only certain partitions need to move.

Some practical note: choose #partition to be just right, if it’s too big the time it takes to transfer partitions when nodes get added/removed is too big. If it’s too small then it incurs more overhead(more data to be stored about node2partition mappings?)

Request Routing

When a client wants to make a request, how does it know which node to connect to? As partitions are rebalanced, the assignment of partitions to nodes changes.

With consistent hashing, this is straighforward, some app server should takes care of this, only key is needed to find the right partition.

With the fixed #partitions approach, there’s another layer about node to partitions mapping. There’re 3 approaches

1 allow client to connect to any node, and ask the node to forward the request to a node that knows where the key live

2 have a routing tier that routes request to correct node

3 clients are aware of which node to connect to(probably require some API to tell the clients or some sort of fixed scheme, complicated and not scalable..)

approach 2 is probably the best.

With approach 1 and approach 2, we still need some way of knowing the node 2 partition mapping, these are usually stored in ZooKeeper or similar services