Seize the deal!
Caching Best Practices
Imagine having a tool that can automatically detect if you are using JPA and Hibernate properly. Hypersistence Optimizer is that tool!
In this article, we are going to see how we can limit the SQL query result set to the Top-N rows only.
Limiting the SQL result set is very important when the underlying query could end up fetching a very large number of records, which can have a significant impact on application performance.
Fetching more data than necessary is the number one cause of data access performance issues. When a given business use case is developed, the amount of data available in both the development and QA environment is rather small, hence not all SQL queries are written so that the result set is limited to a fixed number of records.
After the application is deployed into production, data starts accumulating and queries that were once blazing fast start to run slower and slower. Even if indexes are applied to the SQL query filtering and sorting criteria unless the index covers the entire query (e.g., index-only scan), the table records will have to be inspected using a random-access read pattern.
If the result set size is small and the database could use an index for the filtering and sorting criteria, then the cost associated with reading the table records is still smaller than scanning the entire table. On the other hand, if the result set size is very large and the database needs to access a very large percentage of a given table, then using an index would be less efficient than scanning the entire table.
To prove it, consider we have the following
post table in our database which contains 5000 records:
So, if we don’t restrict the result set to the Top-N records:
EXPLAIN ANALYZE SELECT title FROM post ORDER BY id DESC
The execution plan for the SQL query above looks as follows:
| QUERY PLAN | |------------------------------------------------------| | Sort | | (cost=63.66..64.92 rows=504 width=524) | | (actual time=4.999..5.808 rows=5000 loops=1) | | Sort Key: id DESC | | Sort Method: quicksort Memory: 583kB | | -> Seq Scan on post | | (cost=0.00..41.04 rows=504 width=524) | | (actual time=0.059..1.753 rows=5000 loops=1) | | | | Planning time: 0.833 ms | | Execution time: 6.660 ms |
Notice the sequential scan done on all 5000 rows of the
Now, when adding the LIMIT clause which restricts the result set to 5 records only:
EXPLAIN ANALYZE SELECT title FROM post ORDER BY id DESC LIMIT 5
The execution plan for the Top-N SQL query looks like this:
| QUERY PLAN | |-------------------------------------------------| | Limit | | (cost=0.28..0.46 rows=5 width=24) | | (actual time=0.019..0.021 rows=5 loops=1) | | -> Index Scan Backward using post_pkey on post | | (cost=0.28..178.28 rows=5000 width=24) | | (actual time=0.017..0.019 rows=5 loops=1) | | | | Planning time: 0.854 ms | | Execution time: 0.046 ms |
Notice that an index scan was used this time, and only 5 records were scanned and fetched as well. More, the execution time is hundreds of times lower than the previous execution.
The SQL query result set size can impact the execution plan as the database might choose a full-table scan even if an index is available for the query filtering and sorting criteria.
Not only the execution plan can be less efficient, but fetching more data than necessary will consume a significant amount of resources both on the database, server, and client-side too.
First, the records need to be fetched into the database buffer pool.
Afterward, the records are sent over the network to the server. On the server, the JDBC Driver will allocate all the necessary Java Objects to represent the query result set.
However, since the JDBC
ResultSet is not passed to the client, the records must be transformed into entities or DTOs.
The resulting entities or DTOs might be transformed into JSON and passed again over the network to the client, where the JSON objects need to be loaded into the browser memory prior to being used to render the UI.
Fetching large volumes of data requires a significant amount of resources on multiple layers (e.g., database, server, client).
So, since the UI display has a limited size, it makes no sense to fetch more data than can be displayed at once. Now, depending on the underlying relational database system you are using, the SQL clause that allows you to limit the query result set size, might differ.
Up until SQL:2008, there was no standard way of fetching the Top-N records from a given result set. The standard syntax looks as follows:
SELECT title FROM post ORDER BY id DESC FETCH FIRST 5 ROWS ONLY
FETCH FIRST 5 ROWS ONLY clause which tells the database that we are interested in fetching just the first 5 records only. Another thing to notice is that we are using an
ORDER BY clause since, otherwise, there is no guarantee which are the first records to be included in the returning result set.
The SQL:2008 Top-N records clause is supported in Oracle since 12c, SQL Server since 2012, and PostgreSQL since 8.4.
While SQL Server supports the SQL:2008 Top-N standard syntax, you need to provide the OFFSET clause as well:
SELECT title FROM post ORDER BY id DESC OFFSET 0 ROWS FETCH FIRST 5 ROWS ONLY
Since we are interested in the Top-N records only, the OFFSET is 0 in our case.
Prior to SQL Server 2012, you had to use TOP to limit the result set size:
SELECT TOP 5 title FROM post ORDER BY id DESC
Prior to version 12c, to fetch the Top-N records, you had to use a derived table and the ROWNUM pseudocolumn:
SELECT * FROM ( SELECT title FROM post ORDER BY id DESC ) WHERE ROWNUM <= 5
The reason why the outer derived table is used is that the ROWNUM pseudocolumn value is assigned prior to the ORDER BY clause execution. Using a derived table, we can make sure that the ROWNUM pseudocolumn, we use for filtering the Top-N records, is assigned after sorting the underlying result set.
Traditionally, MySQL and PostgreSQL use the LIMIT clause to restrict the result set to the Top-N records:
SELECT title FROM post ORDER BY id DESC LIMIT 5
If you enjoyed this article, I bet you are going to love my upcoming Online Workshops!
- Caching Best Practices with JPA and Hibernate (2.5 hours) on the 30th of September
- High-Performance SQL (4 hours) on the 6th of October in collaboration with Voxxed Days Ticino
- High-Performance SQL (12 hours) starting on the 28th of October in collaboration with Bouvet
Fetching the right amount of data is very important for application performance. Luckily, SQL allows us to limit a given query to the Top-N records using either the SQL:2008 standard syntax or database-specific alternatives.
Hypersistence Optimizer 2.2 has been released!