How does a relational database work

(Last Updated On: January 4, 2018)

Introduction

While doing my High-Performance Java Persistence training, I came to realize that it’s worth explaining how a relational database works, as otherwise, it is very difficult to grasp many transaction-related concepts like atomicity, durability, and checkpoints.

In this post, I’m going to give a high-level explanation of how a relational database works internally while also hinting some database-specific implementation details.

A picture is worth a thousand words

datastructuredb

Data pages

Disk access is slow. On the other hand, the memory is orders of magnitude faster even than Solid-State Drives. For this reason, database vendors try to delay disk access as much as possible. Whether we are talking about tables or indexes, data is divided into pages of a certain size (e.g. 8 KB).

When it needs to read data (tables or indexes), a relational database will map the disk-based pages into memory buffers. When it needs to modify data, the relational database changes the in-memory pages. To synchronize the in-memory pages with the disk, a flushing must occur (e.g. fsync).

The buffer pool where disk-based pages are stored is limited in size, hence it usually needs to store the data working set. Only if the entire data can fit into memory, the buffer pool could store the entire data set.

However, if the overall data on disk is larger than the buffer pool size when a new page needs to be cached, the buffer pool will have to evict an old page to make room for the new ones.

Undo log

Because the in-memory changes can be accessed by multiple concurrent transactions, a Concurrency Control mechanism (e,g, 2PL or MVCC) must be employed so to ensure data integrity. Therefore, once a transaction has modified a table row, the uncommitted changes are applied to the in-memory structures while the previous data is temporarily stored in an undo log append-only structure.

While this structure is called *undo log* in Oracle and MySQL, in SQL Server, the transaction log plays this role. PostgreSQL does not have an undo log, but the same goal is achieved with a multi-version table structure since tables can store multiple versions of the same row. However, all these data structures are used to provide rolling back capability which is a mandatory requirement for Atomicity.

If the currently running transaction rolls back, the undo log will be used to reconstruct the in-memory pages as of the start of the transaction.

Redo log

Once a transaction commits, the in-memory changes must become persisted. However, that does not mean that every transaction commit will trigger an fsync. In fact, that would be very detrimental to application performance. However, from the ACID transaction properties, we know that a committed transaction must provide Durability, meaning that committed changes need to be persisted even if we unplug the database engine.

So, how does a relational database provide Durability without issuing a fsync on every transaction commit?

That’s where the redo log comes into play. The redo log is also an append-only disk-based structure that stores every change a given transaction has undergone. So, when a transaction commits, every data page change will be written to the redo log as well. Compared to flushing an invariable number of data pages, writing to the redo log is very fast since sequential disk access is way faster than Random access. Hence, it also allows transactions to be fast.

While this structure is called *redo log* in Oracle and MySQL, in SQL Server, the transaction log plays this role as well. PostgreSQL calls it Write-Ahead Log (WAL) instead.

But then, when are the in-memory flushed to the disk?

A relational database system uses checkpoints to synchronize the in-memory dirty pages with their disk-based counterparts. To avoid congesting the IO traffic, the synchronization is usually done in chunks during a larger period of time.

But, what happens if the relational database crashes before flushing all the dirty in-memory pages to the disk?

In case of a crash, upon startup, the database will use the redo log to reconstruct the disk-based data pages that were not synchronized since the last successful checkpoint.

If you enjoyed this article, I bet you are going to love my book as well.

Conclusion

These design considerations were adopted to overcome the high latency of disk-based storages while still providing persistent storage guarantees. So, the undo log is needed to provide Atomicity (rolling back capabilities), while the redo log is needed to ensure disk-based pages (table and indexes) Durability.

Subscribe to our Newsletter

* indicates required
10 000 readers have found this blog worth following!

If you subscribe to my newsletter, you'll get:
  • A free sample of my Video Course about running Integration tests at warp-speed using Docker and tmpfs
  • 3 chapters from my book, High-Performance Java Persistence, 
  • a 10% discount coupon for my book. 
Get the most out of your persistence layer!

Advertisements

14 thoughts on “How does a relational database work

  1. The tables and indexes are not entirely cloned in memory, only the working set is mapped in memory based on how data pages are accessed.

    what tbls/indexes are clone on memory, the one in use on current transaction ?

    1. What tbls/indexes are cloned on memory, the one in use on current transaction?

      Every page that is read goes to the buffered pool. The ones that are dirty cannot be evicted until they are flushed. The ones that are not dirty can be evicted if the pool is full and new pages are read.

      The tables and indexes are not entirely cloned in memory, only the working set is mapped in memory based on how data pages are accessed.

      That’s not true. What if I have a 1GB database, and I configure the DB engine to use a 10GB buffer pool. In this case, the entire data and indexes can fit into memory, right?

Leave a Reply

Your email address will not be published. Required fields are marked *