Why you should use compact table columns

Last modified:

Are you struggling with performance issues in your Spring, Jakarta EE, or Java EE application?

What if there were a tool that could automatically detect what caused performance issues in your JPA and Hibernate data access layer?

Wouldn’t it be awesome to have such a tool to watch your application and prevent performance issues during development, long before they affect production systems?

Well, Hypersistence Optimizer is that tool! And it works with Spring Boot, Spring Framework, Jakarta EE, Java EE, Quarkus, Micronaut, or Play Framework.

So, rather than fixing performance issues in your production system on a Saturday night, you are better off using Hypersistence Optimizer to help you prevent those issues so that you can spend your time on the things that you love!

Introduction

In this article, I’m going to explain you should use compact table columns when designing your database schema.

By using compact table columns, you can cache more table records and index entries and, therefore, speed up your SQL queries.

Database caching

As I explained in this article, relational database systems cache the pages that are loaded from the disk in the Buffer Pool (e.g., shared buffers in PostgreSQL) so that the next time the same page is requested by an SQL query execution, the page will be served from the memory (logical read) instead of from the disk (physical read).

Relational Database Architecture

Even if an SSD (Solid State Drive) is very fast compared to the traditional HDD (Hard Disk Drive), RAM is still orders of magnitude faster than SSD, so it makes sense to try to store the working set of data in memory instead of having to load it from the disk on every SQL query execution.

StackOverflow database schema

Some years ago, StackOverflow published a performance metrics page, and the most interesting aspect is that the web application was using just a single active SQL Server node and one hot standby node to serve the entire planet.

If you’re wondering how StackOverflow could serve 11k query per second with just one active node, then you should know that they use 1.5 TB of RAM while, according to the performance page, the entire StockOverflow DB is 2.8 TB.

The beauty of StackOverflow is that its database schema is available online, so you can take a look and see what choices the StackOverflow developers had made in order to make the web application run fast with just a single SQL Server node.

Here are several findings:

There are no UUID Primary Key and Foreign Key columns that waste 16 bytes to store a single value.
Most Primary Key columns are either int (4 bytes) or tinyint (1 byte) instead of bigint (8 bytes).
Enum values are represented with just 1 byte (e.g., tinyint column).
Many of the VARCHAR columns have a maximum length to avoid storing too much data.

While UUID Primary Ley columns are fine for distributed SQL databases that no longer use a Buffer Pool, like YugabyteDB, for relational database systems, UUID Primary Key columns are, most often, a terrible choice.

For more details about this topic, check out this article.

Compact table columns for MySQL

Let’s assume we have a table that can never hold more than 256 records. If we are using MySQL or SQL Server, then we can choose the tinyint column type for the Primary Key column.

For instance, if we have a country table that stores all the countries where our customers are located, knowing that there are less than 200 countries in the world, it makes no sense to use the int or bigint types for the primary column value.

Not only that we will save space for every country table record, but every foreign key column value will also require less space, as illustrated by the following table diagram:

Compact table columns with MySQL

While this optimization works fine on MySQL, it may not be the same for all relational database systems.

For example, PostgreSQL may use byte padding when aligning columns on disk, as explained in this article.

Testing on MySQL

In the following example, we are going to create 200 country records and 5,000,000 customer table rows.

Beside the default index created on the Primary Key column, we have created an index for the country_id Foreign Key column as well:

CREATE INDEX idx_customer_country_id ON customer (country_id)

When using the tinyint unsigned column type for the country identifier, the customer table has an overall size of 236.27 MB:

select
    ROUND(((DATA_LENGTH + INDEX_LENGTH) / 1024 / 1024), 2)
from information_schema.TABLES
where TABLE_SCHEMA = 'high_performance_java_persistence' AND TABLE_NAME = 'customer'

236.27 MB

The Foreign Key index on the customer table takes 63.58 MB:

select
    ROUND((INDEX_LENGTH / 1024 / 1024), 2)
from information_schema.TABLES
where TABLE_SCHEMA = 'high_performance_java_persistence' AND TABLE_NAME = 'customer'

63.58 MB

When using the smallint unsigned column type for the country identifier, the customer table has an overall size of 307.86 MB:

select
    ROUND(((DATA_LENGTH + INDEX_LENGTH) / 1024 / 1024), 2)
from information_schema.TABLES
where TABLE_SCHEMA = 'high_performance_java_persistence' AND TABLE_NAME = 'customer'

307.86 MB

The Foreign Key indexes of the customer table take 130.17 MB:

select
    ROUND((INDEX_LENGTH / 1024 / 1024), 2)
from information_schema.TABLES
where TABLE_SCHEMA = 'high_performance_java_persistence' AND TABLE_NAME = 'customer'

130.17 MB

So, for a single Foreign Key column that has an associated index, there is a 71.59 MB overhead for using the smallint column type as opposed to the tinyint when the table has five million records. If we have ten such Foreign Key columns that are indexed and the tables have 100 million records, the overhead will be 14.3 GB.

Compact table columns for PostgreSQL

When using PostgreSQL, we need to be aware about that the table row layout follows a specific data alignment and smaller column types may be padded. To ensure the most compact byte alignment, we have to employ the Column Tetris appraoch suggested by Erwin Brandstetter.

Since PostgreSQL does not provide a tinyint column type, we can use the smallint type for the country Primary Key and the country_id Foreign Key column in the customer table.

However, according to the Column Tetris logic, we need to place the customer_id column at the end of the table like this:

create table customer (
    id integer not null,
    first_name varchar(100),
    last_name varchar(100),
    country_id smallint,
    primary key (id)
)

Testing on PostgreSQL

If we create 5,000,000 customer records, for the smallint Foreign Key column, the customer table will take 211 MB:

select pg_size_pretty(pg_table_size('customer'))

211 MB

And the idx_customer_country_id Foreign Key column index takes also 33 MB:

select pg_size_pretty(pg_table_size('idx_customer_country_id')

33 MB

Now, if we switch from smallint to int so that both the Primary Key column in the country table and the Foreign Key column in the customer table take 4 bytes instead of two, the customer table will take 249 MB:

select pg_size_pretty(pg_table_size('customer'))

249 MB

And the idx_customer_country_id Foreign Key column index takes also 33 MB:

select pg_size_pretty(pg_table_size('idx_customer_country_id')

33 MB

So, for five million records, using the more compact smallint column allows us to save 38 MB on the customer table.

If you enjoyed this article, I bet you are going to love my Book and Video Courses as well.

Conclusion

The more compact the table columns, the more table rows and index entries will fit into an in-memory page. The more records are stored in the Buffer Pool or the OS cache, the higher the probability of loading data from the memory instead of the disk when executing a query.

This is not just theoretical advice. StackOverflow has been successfully employing this strategy for over 15 years, and it allowed them to serve the entire planet with just one active SQL Server node.

Follow @vlad_mihalcea