DDIDA Chapter 3 - Storage and Retrieval

Posted on By Guanzhou Song

On the most fundamental level, a database needs to do two things: when you give it some data, it should store the data, and when you ask it again later, it should give the data back to you.

Hash

Key-value stores are quite similar to the dictionary type that you can find in most programming languages.

An append-only log seems wasteful at first glance: why don’t you update the file in place, overwriting the old value with the new value? But an append-only design turns out to be good for several reasons:

  • Appending and segment merging are sequential write operations, which are gen‐ erally much faster than random writes

  • Concurrency and crash recovery are much simpler if segment files are append-only or immutable.

  • Merging old segments avoids the problem of data files getting fragmented over time.

However, the hash table index also has limitations:

  • The hash table must fit in memory, so if you have a very large number of keys, you’re out of luck.

  • Range queries are not efficient.

SSTables and LSM-Trees

We require that the sequence of key-value pairs is sorted by key and call this format Sorted String Table, or SSTable for short. We also require that each key only appears once within each merged segment file (the compaction process already ensures that).

SSTables have several big advantages over log segments with hash indexes:

  • Merging segments is simple and efficient, even if the files are bigger than the available memory (Mergesort).

  • In order to find a particular key in the file, you no longer need to keep an index of all the keys in memory.

  • Since read requests need to scan over several key-value pairs in the requested range anyway, it is possible to group those records into a block and compress it before writing it to disk

Constructing and maintaining SSTables

Maintaining a sorted structure on disk is possible (B-Trees), but maintaining it in memory is much easier.

There are plenty of well-known tree data structures that you can use, such as red-black trees or AVL trees.

With these data structures, you can insert keys in any order and read them back in sorted order.

We can now make our storage engine work as follows:

  • When a write comes in, add it to an in-memory balanced tree data structure (for example, a red-black tree). This in-memory tree is sometimes called a memtable.

  • When the memtable gets bigger than some threshold—typically a few megabytes —write it out to disk as an SSTable file. This can be done efficiently because the tree already maintains the key-value pairs sorted by key.

  • The new SSTable file becomes the most recent segment of the database. While the SSTable is being written out to disk, writes can continue to a new memtable instance.

  • In order to serve a read request, first try to find the key in the memtable, then in the most recent on-disk segment, then in the next-older segment, etc.

  • From time to time, run a merging and compaction process in the background to combine segment files and to discard overwritten or deleted values.

Log-Structured Merge-Tree (or LSM-Tree)

Lucene, an indexing engine for full-text search used by Elasticsearch and Solr, uses a similar method for storing its term dictionary.

A full-text index is much more complex than a key-value index but is based on a similar idea:

given a word in a search query, find all the documents (web pages, product descriptions, etc.) that mention the word.

This is implemented with a key-value structure where the key is a word (a term) and the value is the list of IDs of all the documents that contain the word (the postings list).

In Lucene, this mapping from term to postings list is kept in SSTable-like sorted files, which are merged in the background as needed.

the basic idea of LSM-trees—keeping a cascade of SSTables that are merged in the background.

B-Trees

The most widely used indexing structure is quite different: the B-tree.

B-trees break the database down into fixed-size blocks or pages, traditionally 4 KB in size (sometimes bigger).

Each page can be identified using an address or location, which allows one page to refer to another—similar to a pointer.

The number of references to child pages in one page of the B-tree is called the branching factor.

In practice, the branching factor depends on the amount of space required to store the page references and the range boundaries, but typically it is several hundred.

Algorithm ensures that the tree remains balanced: a B-tree with n keys always has a depth of O(log n).

Most databases can fit into a B-tree that is three or four levels deep, so you don’t need to follow many page references to find the page you are looking for.

Making B-trees reliable

The basic underlying write operation of a B-tree is to overwrite a page on disk with new data.

It is assumed that the overwrite does not change the location of the page;

i.e., all references to that page remain intact when the page is overwritten.

This is in stark contrast to log-structured indexes such as LSM-trees, which only append to files (and eventually delete obsolete files) but never modify files in place.

You can think of overwriting a page on disk as an actual hardware operation.

On a magnetic hard drive, this means moving the disk head to the right place, waiting for the right position on the spinning platter to come around, and then overwriting the appropriate sector with new data.

On SSDs, what happens is somewhat more complicated, due to the fact that an SSD must erase and rewrite fairly large blocks of a storage chip at a time.

Moreover, some operations require several different pages to be overwritten.

For example, if you split a page because an insertion caused it to be overfull, you need to write the two pages that were split, and also overwrite their parent page to update the references to the two child pages.

This is a dangerous operation, because if the database crashes after only some of the pages have been written, you end up with a corrupted index (e.g., there may be an orphan page that is not a child of any parent).

In order to make the database resilient to crashes, it is common for B-tree implementations to include an additional data structure on disk: a write-ahead log (WAL, also known as a redo log).

This is an append-only file to which every B-tree modification must be written before it can be applied to the pages of the tree itself.

When the database comes back up after a crash, this log is used to restore the B-tree back to a consistent state.

An additional complication of updating pages in place is that careful concurrency control is required if multiple threads are going to access the B-tree at the same time — otherwise a thread may see the tree in an inconsistent state.

This is typically done by protecting the tree’s data structures with latches (lightweight locks).

Log-structured approaches are simpler in this regard, because they do all the merging in the background without interfering with incoming queries and atomically swap old segments for new segments from time to time.

B-tree optimizations

As B-trees have been around for so long, it’s not surprising that many optimizations have been developed over the years.

To mention just a few:

  • Instead of overwriting pages and maintaining a WAL for crash recovery, some databases (like LMDB) use a copy-on-write scheme.

  • A modified page is written to a different location, and a new version of the parent pages in the tree is created, pointing at the new location. This approach is also useful for concurrency control.

  • We can save space in pages by not storing the entire key, but abbreviating it. Especially in pages on the interior of the tree, keys only need to provide enough information to act as boundaries between key ranges.

  • Packing more keys into a page allows the tree to have a higher branching factor, and thus fewer levels.iii

  • In general, pages can be positioned anywhere on disk; there is nothing requiring pages with nearby key ranges to be nearby on disk. If a query needs to scan over a large part of the key range in sorted order, that page-by-page layout can be inefficient, because a disk seek may be required for every page that is read.

  • Many B- tree implementations therefore try to lay out the tree so that leaf pages appear in sequential order on disk. However, it’s difficult to maintain that order as the tree grows. By contrast, since LSM-trees rewrite large segments of the storage in one go during merging, it’s easier for them to keep sequential keys close to each other on disk.

  • Additional pointers have been added to the tree. For example, each leaf page may have references to its sibling pages to the left and right, which allows scanning keys in order without jumping back to parent pages.

  • B-tree variants such as fractal trees borrow some log-structured ideas to reduce disk seeks (and they have nothing to do with fractals).

Transaction Processing or Analytics

OLTP: An application typically looks up a small number of records by some key, using an index. Records are inserted or updated based on the user’s input. Because these applications are interactive, the access pattern became known as online transaction processing (OLTP).

OLAP: Queries written by business analysts, and feed into reports that help the management of a company make better decisions (business intelligence). In order to differentiate this pattern of using databases from transaction processing, it has been called online analytic processing (OLAP).

Property Transaction processing systems (OLTP) Analytic systems (OLAP)
Main read pattern Small number of records per query, fetched by key Aggregate over large number of records
Main write pattern Random-access, low-latency writes from user input Bulk import (ETL) or event stream
Primarily used by End user/customer, via web application Internal analyst, for decision support
What data represents Latest state of data (current point in time) History of events that happened over time
Dataset size Gigabytes to terabytes Terabytes to petabytes

Data Warehousing

An enterprise may have dozens of different transaction processing systems

systems powering the customer-facing website, controlling point of sale (checkout) systems in physical stores, tracking inventory in warehouses, planning routes for vehicles, managing suppliers, administering employees, etc.

Each of these systems is complex and needs a team of people to maintain it, so the systems end up operating mostly auton‐ omously from each other.

These OLTP systems are usually expected to be highly available and to process trans‐ actions with low latency, since they are often critical to the operation of the business.

The data warehouse contains a read-only copy of the data in all the various OLTP systems in the company.

The data model of a data warehouse is most commonly relational, because SQL is generally a good fit for analytic queries.

On the surface, a data warehouse and a relational OLTP database look similar, because they both have a SQL query interface.

However, the internals of the systems can look quite different, because they are optimized for very different query patterns.

Many database vendors now focus on supporting either transaction processing or analytics workloads, but not both.

Stars and Snowflakes: Schemas for Analytics

a wide range of different data models are used in the realm of transaction processing, depending on the needs of the application.

On the other hand, in analytics, there is much less diversity of data models.

Many data warehouses are used in a fairly formulaic style, known as a star schema (also known as dimen‐ sional modeling

The name “star schema” comes from the fact that when the table relationships are visualized, the fact table is in the middle, surrounded by its dimension tables; the connections to these tables are like the rays of a star.

In a typical data warehouse, tables are often very wide: fact tables often have over 100 columns, sometimes several hundred

Column-Oriented Storage

If you have trillions of rows and petabytes of data in your fact tables, storing and querying them efficiently becomes a challenging problem.

Dimension tables are usu‐ ally much smaller (millions of rows), so in this section we will concentrate primarily on storage of facts.

The idea behind column-oriented storage is simple: don’t store all the values from one row together, but store all the values from each column together instead.

If each column is stored in a separate file, a query only needs to read and parse those columns that are used in that query, which can save a lot of work.

The column-oriented storage layout relies on each column file containing the rows in the same order.

Thus, if you need to reassemble an entire row, you can take the 23rd entry from each of the individual column files and put them together to form the 23rd row of the table.

Column Compression

Besides only loading those columns from disk that are required for a query, we can further reduce the demands on disk throughput by compressing data. Fortunately, column-oriented storage often lends itself very well to compression.

One technique that is particu‐ larly effective in data warehouses is bitmap encoding.

Besides reducing the volume of data that needs to be loaded from disk, column-oriented storage layouts are also good for making efficient use of CPU cycles.

Sort Order in Column Storage

In a column store, it doesn’t necessarily matter in which order the rows are stored.

It’s easiest to store them in the order in which they were inserted, since then inserting a new row just means appending to each of the column files.

We can only reconstruct a row because we know that the kth item in one column belongs to the same row as the kth item in another column.

Another advantage of sorted order is that it can help with compression of columns.

Writing to Column-Oriented Storage

These optimizations make sense in data warehouses, because most of the load consists of large read-only queries run by analysts.

Column-oriented storage, compression, and sorting all help to make those read queries faster. However, they have the downside of making writes more difficult.

An update-in-place approach, like B-trees use, is not possible with compressed columns.

All writes first go to an in-memory store, where they are added to a sorted structure and prepared for writing to disk.

It doesn’t matter whether the in-memory store is row-oriented or column-oriented.

When enough writes have accumulated, they are merged with the column files on disk and written to new files in bulk.

Aggregation: Data Cubes and Materialized Views

Another aspect of data warehouses that is worth mentioning briefly is materialized aggregates.

Data warehouse queries often involve an aggregate function, such as COUNT, SUM, AVG, MIN, or MAX in SQL. If the same aggregates are used by many different queries, it can be wasteful to crunch through the raw data every time.

Why not cache some of the counts or sums that queries use most often?

In a relational data model, it is often defined like a standard (virtual) view: a table-like object whose contents are the results of some query.

When the underlying data changes, a materialized view needs to be updated, because it is a denormalized copy of the data.

In read-heavy data warehouses they can make more sense.

Summary

On a high level, we saw that storage engines fall into two broad categories:

those optimized for transaction processing (OLTP), and those optimized for analytics (OLAP).

There are big differences between the access patterns in those use cases:

  • OLTP systems are typically user-facing, which means that they may see a huge volume of requests. In order to handle the load, applications usually only touch a small number of records in each query.
  • The application requests records using some kind of key, and the storage engine uses an index to find the data for the requested key. Disk seek time is often the bottleneck here.

  • Data warehouses and similar analytic systems are less well known, because they are primarily used by business analysts, not by end users. They handle a much lower volume of queries than OLTP systems, but each query is typically very demanding, requiring many millions of records to be scanned in a short time.
  • Disk bandwidth (not seek time) is often the bottleneck here, and column- oriented storage is an increasingly popular solution for this kind of workload.

On the OLTP side, we saw storage engines from two main schools of thought:

  • The log-structured school, which only permits appending to files and deleting obsolete files, but never updates a file that has been written. Bitcask, SSTables, LSM-trees, LevelDB, Cassandra, HBase, Lucene, and others belong to this group.

  • The update-in-place school, which treats the disk as a set of fixed-size pages that can be overwritten. B-trees are the biggest example of this philosophy, being used in all major relational databases and also many nonrelational ones.

analytic workloads are so different from OLTP: when your queries require sequentially scanning across a large number of rows, indexes are much less relevant.

Instead it becomes important to encode data very compactly, to minimize the amount of data that the query needs to read from disk.