YugabyteDB Architecture

Last modified:

Are you struggling with performance issues in your Spring, Jakarta EE, or Java EE application?

Imagine having a tool that could automatically detect performance issues in your JPA and Hibernate data access layer long before pushing a problematic change into production!

With the widespread adoption of AI agents generating code in a heartbeat, having such a tool that can watch your back and prevent performance issues during development, long before they affect production systems, can save your company a lot of money and make you a hero!

Hypersistence Optimizer is that tool, and it works with Spring Boot, Spring Framework, Jakarta EE, Java EE, Quarkus, Micronaut, or Play Framework.

So, rather than allowing performance issues to annoy your customers, you are better off preventing those issues using Hypersistence Optimizer and enjoying spending your time on the things that you love!

Introduction

In this article, we are going to explore the YugabyteDB architecture and see how it manages to provide automatic sharding and failover without compromising data integrity.

YugabyteDB is a distributed SQL database, so its architecture is different than the ones employed by traditional relational database systems.

Traditional relational database architecture

Most relational database systems use a Single-Primary replication architecture, as illustrated by the following diagram:

Database Replication

There is a single Primary node that handles read-write transactions, while several Replica nodes can serve read-only transactions.

Data changes happen only on the Primary node and are propagated to the Replica nodes by applying the transactions from the Primary node Redo Log. If the Primary node breaks, then one of the existing Replica nodes becomes the new Primary.

As I explained in this article, the internal relational database system architecture looks as follows:

Relational Database Architecture

The Buffer Pool is used to mirror table and index pages from the disk into the database memory to avoid costly IO disk operations.

When a table record is modified, the change happens directly on the page stored in the Buffer Pool, and The Undo Log will store the changes needed to recover the previous state of the record in case the current transaction needs to roll back. So, the Undo Log is the structure that provides Atomicity.

If the transaction is committed, the database will write the record modification into the Redo Log, which can be used to replicate the changes from the Primary node to the Replica nodes.

The Buffer Pool is not synchronized at commit time. Instead, the in-memory pages are synchronized continuously so that the I/O workload is distributed over time without incurring I/O spikes that could affect the OLTP transactions running on the system.

If the database crashes, the unsynchronized modifications that were stored in the Buffer Pool will be lost. However, upon restarting the database server, the database will recover that state from the Redo Log by replaying the transactions that didn’t get the chance to be synchronized with the disk. Therefore, the Redo Log is the structure that provides Durability.

There are several challenges imposed by this traditional relational database architecture.

First, the Buffer Pool has a limited size. While Oracle, SQL Server, and MySQL allow you to allocate a large portion of the available RAM to the Buffer Pool, PostgreSQL uses a rather small shared_buffer, as it relies on the Operating System cache for speeding up the loading of the pages that had to be evicted from the shared buffer.

Since usually a single Primary node is being used, scaling requires adding more RAM to the Primary node to accommodate more and more pages that are needed by the working data set. And not only RAM comes in limited supply. CPU can also be saturated, or the system can run out of disk. So, to handle more load, you’d need to migrate the database nodes to servers that have more resources (e.g., RAM, CPU cores, disk space).

Increasing throughput by adding more resources to a given database instance is called vertical scaling. However, scaling vertically can be very costly, especially when using cloud computing.

So, when it’s no longer feasible to increase the Primary node hardware capacity, sharding has to be employed. But this is easier said than done since the traditional relational database SQL query planners are not designed to work efficiently when data is scattered among multiple shards.

YugabyteDB top-level architecture

As a distributed SQL database, YugabyteDB has a very different architecture compared to traditional relational database systems.

YugabyteDB Architecture

YugabyteDB uses auto-sharding by default, so data is scattered automatically across multiple nodes. However, unlike the Single-Primary replication, all nodes can be used both for reading and writing data.

In order to allow all database nodes to accept writes, one node becomes the Leader (Primary) of a given table record while the other nodes become Followers (Replicas) of that table row.

To accommodate this design, YugabyteDB needs a service that can locate the node that stores the Leader shard for the record we want to change, and that’s the reason why YugabyteDB provides the YB-Master service

YugabyteDB YB-Master

The YB-Master service stores the table metadata and the mapping between the table rows and the nodes that store the Leader Tablet and the Replica Tablets that store the entry for that particular table row.

Knowing where the Leader and Followers are stored, the YB-Master service provides load balancing, as well as auto-sharding.

For fault tolerance, the YB-Master service is run on three nodes, one node storing the YB-Master Leader while the other two nodes storing the YB-Master Followers. The Leader election is done using the Raft consensus protocol.

If a cluster has more than three nodes, the other database nodes that don’t host the YB-Master service will just cache the database metadata locally.

YugabyteDB YB-TServer

Unlike the YB-Master, each database node stores a YB-TServer service, which is responsible for executing queries and statements and storing the underlying data.

The YB-TServer has two main components:

the YQL query engine
the DocDB storage engine

YQL query engine

The YQL query engine provides two APIs:

YSQL – an SQL query engine based on PostgreSQL
YCQL – a semi-relational API based on Cassandra Query Language

Because the YSQL engine is based on PostgreSQL, you can connect to YugabyteDB using the PostgreSQL JDBC Driver. However, if you’re using a YugabyteDB cluster that distributes the load on multiple nodes, it’s better to use the YugabyteDB-specific JDBC Driver, which provides automatic load balancing and failover.

For more details about YugabyteDB fault tolerance, check out this article.

Just like the PostgreSQL query parser, the YugabyteDB YSQL query engine provides the following:

the Parser, which compiles the SQL query into an Abstract Syntax Tree query model
the Optimizer, which generates the Execution Plan
The Executor, which runs the Execution Plan against the data storage engine

DocDB storage engine

Unlike a relational database, YugabyteDB doesn’t update the rows directly in the Buffer Pool pages and synchronizes them with the disk when the in-memory Heap Table or B+Tree pages become dirty.

Instead, YugabyteDB uses a distributed document storage called DocDB which stores the table records as documents in a LSM (log-structured merge) tree structure. The LSM-Tree is more efficient for writes on SSD and is easier to replicate since it’s an append-only structure.

The DocDB document store is based on RocksDB, a very popular distributed key/value store.

For instance, Facebook uses RocksDB for its MySQL storage engine, which is called MyRocks.

In YugabyteDB, a database table is split into multiple shards, which are called Tablets in YugabyteDB. These Tablers are distributed across multiple nodes. One node will store the Leader Tablet, while the other nodes will store the Followers for that particular Tablet.

When updating a table record, YugabyteDB locates the node holding the Leader Tablet and first applies the change to the Leader Tablet. Then it replicates the change to one Tablet Replica synchronously and to the other Followers asynchronously.

By using multiple Tablets or shards, reads and writes can be distributed among multiple nodes, which can provide better scalability.

If you enjoyed this article, I bet you are going to love my Book and Video Courses as well.

Conclusion

While using YugabyteDB is just as easy as using PostgreSQL, its architecture is very different due to its distributed nature.

Knowing the YugabyteDB architecture makes it easier to understand why it can offer strong consistency without employing taking row-level locks or why it can load balance both reads and writes to multiple nodes.

This research was funded by Yugabyte and conducted in accordance with the blog ethics policy.

While the article was written independently and reflects entirely my opinions and conclusions, the amount of work involved in making this article happen was compensated by Yugabyte.