Database Storage Engine Internals Summary

4 min readOct 15, 2020

Just short summary of main types of database storage engines, not intended as a full article to explain everything

B-Tree

B-Tree has been the standard index implementation behind most relational databases like MySql(InnoDB), PostgreSql, some non-relational database like CouchBase and earlier version of MongoDB. So B-Tree can be used in both Sql and NoSql databases.

What is shown in Figure 3–6 is a B-tree index/primary index/clustered index that properly stores the primary keys of a table and the tree points you to a leaf page containing the key you’re looking for. Unlike a binary tree which only has 2 children, B-tree can have hundreds of children(branching factors), so tree will be flat, enabling fast search/insert. As B-tree nodes grow/removed, the tree stays balanced so tree remain flat and have height = O(logN)

For secondary index, there’s another B-tree, whose leaf node doesn’t point to a physical page, but stores the primary key which is then used to find the row in the clustered index

What if there’s a crash in the middle of updating 2 pages? it’s possible for it to end up in an inconsistent state. Usually we write every B-tree operation to a log file and recover from it.

LSM Trees/SSTables

SSTables(Sorted String Table) approach is easier to understand in terms of data structure.

The above picture is essentially how we store data in SSTables, a bunch of segment files each containing a list of sorted key-value pairs.

Upon writing a key value pair, the key-value pair is appended to a segment file. After a segment file reaches a certain size, create a new segment file and append to it instead. As segment grows, a background process will merge neighboring segments to remove duplicates and avoid data from growing too large.

Note that keys in segment file are sorted. They’re sorted because we don’t simply append each key-value pair in the order they appear. Instead, they’ll go through a balanced tree layer which sorts them before appending to segment file, see memtable below for more details.

Upon reading a key, the most recent segment is used to find the key, if it doesn’t exist, find the next segment and so on.

Actually, It’d be slow if that’s how we find a key. Instead, we maintain an index in memory. The index tells you where your key is(sort of). For example, if the target key is handicap, then we know which offset ranges of which segment file it’s in. Then, to find the key, simply scan that range.

So that’s how SSTable works, but how is the whole data structure constructed in the first place? In another words, how do we keep strings in segment file sorted?

Imagine we keep in memory some sort of balanced tree(AVL or Red Black Tree), which we call memtable. Write requests are always served by the memtable. When the memtable grows to a certain size, it’s written to segment files in sorted order, this is an O(nlogn) operation in a balanced tree.

Read requests are also served by memtable first, if it doesn’t exist, then at least we can get a small offset range into the segment files and scan from there.

Advantages of LSM trees: range queries is efficient since sorted keys are close together. Write throughput is big as disk write pattern is sequential instead of random, in-place writes in B-tree.

Database that uses LSM:Apache Cassandra, Elasticsearch (Lucene), Google Bigtable, Apache HBase, and InfluxDB.

B-Tree VS LSM Trees/SSTables

Read/Write speed

LSM-trees are typically faster for writes, whereas B-trees are thought to be faster for reads. Reads are typically slower on LSM-trees because they have to check several different data structures and SSTables at different stages of compaction. Although bloom filters are used to mitigate this(to reduce the number of files to be checked during a point query).

B-tree writes are slower, because it needs to write to several pages for a write(one for the actual data and one for page-split, or maybe writing to write-ahead logs)

LSM-Tree’s compaction process can interfere with read/write performance at high percentiles(p75 is probably fine though).

Summary:

For high write-heavy application, go for LSM Trees as their sequential write pattern allows for higher write throughput.

For faster reads, go for B-tree, it’s O(logn) to find stuff. No need to scan through data.