While doing my High-Performance Java Persistence training, I came to realize that it’s worth explaining how a relational database works, as otherwise, it is very difficult to grasp many transaction-related concepts like atomicity, durability, and checkpoints.
In this post, I’m going to give a high-level explanation of how a relational database works internally while also hinting some database-specific implementation details.
A picture is worth a thousand words
Disk access is slow. On the other hand, the memory is orders of magnitude faster even than Solid-State Drives. For this reason, database vendors try to delay disk access as much as possible. Whether we are talking about tables or indexes, data is divided into pages of a certain size (e.g. 8 KB).
When it needs to read data (tables or indexes), a relational database will map the disk-based pages into memory buffers. When it needs to modify data, the relational database changes the in-memory pages. To synchronize the in-memory pages with the disk, a flushing must occur (e.g. fsync).
The buffer pool where disk-based pages are stored is limited in size, hence it usually needs to store the data working set. Only if the entire data can fit into memory, the buffer pool could store the entire data set.
However, if the overall data on disk is larger than the buffer pool size when a new page needs to be cached, the buffer pool will have to evict an old page to make room for the new ones.
Because the in-memory changes can be accessed by multiple concurrent transactions, a Concurrency Control mechanism (e,g, 2PL or MVCC) must be employed so to ensure data integrity. Therefore, once a transaction has modified a table row, the uncommitted changes are applied to the in-memory structures while the previous data is temporarily stored in an undo log append-only structure.
While this structure is called *undo log* in Oracle and MySQL, in SQL Server, the transaction log plays this role. PostgreSQL does not have an undo log, but the same goal is achieved with a multi-version table structure since tables can store multiple versions of the same row. However, all these data structures are used to provide rolling back capability which is a mandatory requirement for Atomicity.
If the currently running transaction rolls back, the undo log will be used to reconstruct the in-memory pages as of the start of the transaction.
Once a transaction commits, the in-memory changes must become persisted. However, that does not mean that every transaction commit will trigger an fsync. In fact, that would be very detrimental to application performance. However, from the ACID transaction properties, we know that a committed transaction must provide Durability, meaning that committed changes need to be persisted even if we unplug the database engine.
So, how does a relational database provide Durability without issuing a fsync on every transaction commit?
That’s where the redo log comes into play. The redo log is also an append-only disk-based structure that stores every change a given transaction has undergone. So, when a transaction commits, every data page change will be written to the redo log as well. Compared to flushing an invariable number of data pages, writing to the redo log is very fast since sequential disk access is way faster than Random access. Hence, it also allows transactions to be fast.
While this structure is called *redo log* in Oracle and MySQL, in SQL Server, the transaction log plays this role as well. PostgreSQL calls it Write-Ahead Log (WAL) instead.
But then, when are the in-memory changes flushed to the disk?
A relational database system uses checkpoints to synchronize the in-memory dirty pages with their disk-based counterparts. To avoid congesting the IO traffic, the synchronization is usually done in chunks during a larger period of time.
But, what happens if the relational database crashes before flushing all the dirty in-memory pages to the disk?
In case of a crash, upon startup, the database will use the redo log to reconstruct the disk-based data pages that were not synchronized since the last successful checkpoint.
These design considerations were adopted to overcome the high latency of disk-based storages while still providing persistent storage guarantees. So, the undo log is needed to provide Atomicity (rolling back capabilities), while the redo log is needed to ensure disk-based pages (table and indexes) Durability.