How does a relational database work

Introduction

While doing my High-Performance Java Persistence training, I came to realize that it’s worth explaining how a relational database works, as otherwise, it is very difficult to grasp many transaction-related concepts like atomicity, durability, and checkpoints.

In this post, I’m going to give a high-level explanation of how a relational database works internally while also hinting some database-specific implementation details.

A picture is worth a thousand words

datastructuredb

Data pages

Disk access is slow. On the other hand, the memory is orders of magnitude faster even than Solid-State Drives. For this reason, database vendors try to delay disk access as much as possible. Whether we are talking about tables or indexes, data is divided into pages of a certain size (e.g. 8 KB).

When it needs to read data (tables or indexes), a relational database will map the disk-based pages into memory buffers. When it needs to modify data, the relational database changes the in-memory pages. To synchronize the in-memory pages with the disk, a flushing must occur (e.g. fsync).

The buffer pool where disk-based pages are stored is limited in size, hence it usually needs to store the data working set. Only if the entire data can fit into memory, the buffer pool could store the entire data set.

However, if the overall data on disk is larger than the buffer pool size, when a new page needs to be cached, the buffer pool will have to evict an old pages to make room for the new ones.

Undo log

Because the in-memory changes can be accessed by multiple concurrent transactions, a Concurrency Control mechanism (e,g, 2PL or MVCC) must be employed so to ensure data integrity. Therefore, once a transaction has modified a table row, the uncommitted changes are applied to the in-memory structures while the previous data is temporarily stored in an undo log append-only structure.

While this structure is called *undo log* in Oracle and MySQL, in SQL Server, the transaction log plays this role. PostgreSQL does not have an undo log, but the same goal is achieved with a multi-version table structure since tables can store multiple versions of the same row. However, all these data structures are used to provide rolling back capability which is a mandatory requirement for Atomicity.

If the currently running transaction rolls back, the undo log will be used to reconstruct the in-memory pages as of the start of the transaction.

Redo log

Once a transaction commits, the in-memory changes must become persisted. However, that does not mean that every transaction commit will trigger an fsync. In fact, that would be very detrimental to application performance. However, from the ACID transaction properties, we know that a committed transaction must provide Durability, meaning that committed changes need to be persisted even if we unplug the database engine.

So, how does a relational database provide Durability without issuing a fsync on every transaction commit?

That’s where the redo log comes into play. The redo log is also an append-only disk-based structure that stores every change a given transaction has undergone. So, when a transaction commits, every data page change will be written to the redo log as well. Compared to flushing an invariable number of data pages, writing to the redo log is very fast since sequential disk access is way faster than Random access. Hence, it also allows transactions to be fast.

While this structure is called *redo log* in Oracle and MySQL, in SQL Server, the transaction log plays this role as well. PostgreSQL calls it Write-Ahead Log (WAL) instead.

But then, when are the in-memory flushed to the disk?

A relational database system uses checkpoints to synchronize the in-memory dirty pages with their disk-based counterparts. To avoid congesting the IO traffic, the synchronization is usually done in chunks during a larger period of time.

But, what happens if the relational database crashes before flushing all the dirty in-memory pages to the disk?

In case of a crash, upon startup, the database will use the redo log to reconstruct the disk-based data pages that were not synchronized since the last successful checkpoint.

If you enjoyed this article, I bet you are going to love my book as well.

Conclusion

These design considerations were adopted to overcome the high latency of disk-based storages while still providing persistent storage guarantees. So, the undo log is needed to provide Atomicity (rolling back capabilities), while the redo log is needed to ensure disk-based pages (table and indexes) Durability.

If you liked this article, you might want to subscribe to my newsletter too.

Advertisements

14 thoughts on “How does a relational database work

  1. Thanks for that explanation, it is very helpful.
    But I still have one question on durability. As I understood, redo_log is in-memory structure (RAM). The user case: transaction is commit (fsync is not invoked yet), and, right after that, server is off (no electricity). How durability is guaranteed in that case?

    1. The redo_log is not an in-memory structure. It’s an append-only log on the disk. The difference between flushing all pages and writing to a log is that a sequential write is much faster.

  2. what tbls/indexes are clone on memory, the one in use on current transaction ? if yes, what happen when we’ll have more transactions on different tables, all tbls/ index will be on memory ?

    thx

    1. The tables and indexes are not entirely cloned in memory, only the working set is mapped in memory based on how data pages are accessed. The pages are shared by all transactions, and locks are used to coordinate concurrent writes.

      1. and this mean that if I have a ‘select * from tbl_1 where filter_condition’, in filter_condition I have column without index, I’ll have in memory all set and on it will be applied filter_condition or filter_condition its applied before to get the set on memory ?

        thx Vlad 😉

      2. If you don’t have an index on the columns in WHERE clause predicate, the DB will use a full-table scan. That means that it will have to load the table in memory because otherwise will not be able to filter it out. If you have configured the shared buffers high enough, the whole table could be mapped into memory, otherwise it will have to deallocate old pages to make room for new ones while executing the query.

    1. No. The size of the buffer affects how many in-memory data page cache entries you can store, while JDBC fetchSize determines how many roundtrips you need to push all data into the when the whole ResultSet was consumed.

  3. The tables and indexes are not entirely cloned in memory, only the working set is mapped in memory based on how data pages are accessed. “database will us ethe redo log”

    1. It depends on the size of the data and the available memory. If you can fit all your data into memory, then this statement would be correct. Otherwise, the database will have to remove pages when needing to access new ones if the buffer is already full.

  4. The tables and indexes are not entirely cloned in memory, only the working set is mapped in memory based on how data pages are accessed.

    what tbls/indexes are clone on memory, the one in use on current transaction ?

    1. What tbls/indexes are cloned on memory, the one in use on current transaction?

      Every page that is read goes to the buffered pool. The ones that are dirty cannot be evicted until they are flushed. The ones that are not dirty can be evicted if the pool is full and new pages are read.

      The tables and indexes are not entirely cloned in memory, only the working set is mapped in memory based on how data pages are accessed.

      That’s not true. What if I have a 1GB database, and I configure the DB engine to use a 10GB buffer pool. In this case, the entire data and indexes can fit into memory, right?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s