What’s new in JPA 2.2 – Stream the result of a Query execution

Imagine having a tool that can automatically detect JPA and Hibernate performance issues. Wouldn’t that be just awesome?

Well, Hypersistence Optimizer is that tool! And it works with Spring Boot, Spring Framework, Jakarta EE, Java EE, Quarkus, or Play Framework.

So, enjoy spending your time on the things you love rather than fixing performance issues in your production system on a Saturday night!

Introduction

In this article, we are going to see how the JPA 2.2 Stream query results are supported by Hibernate and the caveats of using database cursors just to limit the amount of data that needs to be fetched.

Reason for streaming

Probably one of the most expected JPA 2.2 features, Streaming query results was added for the following reasons:

Reading large datasets using JPA is quite uncomfortable these days as all method signatures return Lists, which causes the entire `ResultSet` to be pulled into memory before it can be handed to clients.

Currently, users work around this by paging through the results which sort of works but is error-prone regarding inserts and deletes that might touch the same set of data to be read causing inconsistencies while iterating.

The aforementioned quote was taken from the associated JPA 2.2 GitHub issue.

Fetching too much data can cause serious data access performance issues, that’s why queries should only select just as much data as needed by a given business use case.

Traditionally, there have been two ways of limiting the amount of data fetched by a ResultSet:

  • using SQL-level pagination (e.g. OFFSET with LIMIT or Keyset Pagination)
  • streaming via a database cursor

While JPA already offers the SQL_level pagination via the setFirstResult and setMaxResults Query methods, database cursor streaming was not supported by the JPA 2.1 specification.

By default, a JPA Query will always return either a List or a single object. To create the List, the JDBC ResultSet must be navigated, hence all the underlying data is going to be extracted in order to populate the returning List (e.g. entities, DTO projections).

Therefore, the argument for streaming is that the ResultSet will not need to be iterated in order to construct the returning List.

Gotchas

Assuming you run the following test case:

List<Post> posts = doInJPA(entityManager -> {
    try(Stream<Post> postStream = entityManager
        .createQuery(
            "select p " +
            "from Post p " +
            "order by p.createdOn desc", Post.class)
        .unwrap(Query.class)
        .stream()
    ) {
        return postStream
        .limit( 50 )
        .collect( 
            Collectors.toList() 
        );
    }
});

The Hibernate stream method creates a JDBC scrollable ResultSet like this:

final ScrollableResultsImplementor scrollableResults = scroll( 
    ScrollMode.FORWARD_ONLY 
);

However, by the time the limit method is called, Hibernate already generated the following SQL statement:

SELECT 
    p.id as id1_0_, 
    p.title as title2_0_ 
FROM 
    post p
ORDER BY
    p.created_on DESC

Minding the JDBC Driver fetch size

The JDBC ResultSet acts as an application-level cursor, and it’s up to the JDBC driver to decide how the underlying data set is to be fetched:

  • On Oracle, the default JDBC fetch size is just 10 meaning that a ResultSet of 100 entries will require 10 database roundtrips.
  • On SQL Server, an adaptive buffering technique is employed so that data is not fetched at once, but in batches and only if needed by the client.
  • On PostgreSQL and MySQL, the whole ResultSet is fetched at once even if the client does not navigate the entire ResultSet.

To use a custom fetch size, you need to call the setFetchSize method on the JDBC Statement object.

For MySQL, in order to use a cursor, you have two options:

For PostgreSQL, to use a database cursor, you need to provide a given fetch size, as illustrated by the following example:

try(Stream<Post> postStream = entityManager
    .createQuery(
        "select p " +
        "from Post p " +
        "order by p.createdOn desc", Post.class)
    .setHint( QueryHints.HINT_FETCH_SIZE, 50 )
    .unwrap(Query.class)
    .stream()
) {
    return postStream
    .limit( 50 )
    .collect( 
        Collectors.toList() 
    );
}

Without the QueryHints.HINT_FETCH_SIZE JPA hint, Hibernate will use the default Statement fetch size which will cause the entire ResultSet to be fetched at once on the client side.

Performance considerations

Some ideas sound really in theory, but, in reality, they prove to be suboptimal. A while ago, Mark Paluch opened the HHH-11260 issue which required Hibernate to automatically set the Integer.MIN_VALUE fetch size whenever calling the stream or scroll Query methods.

However, while running a performance benchmark, we realized that MySQL streaming fetching the entire ResultSet is actually more efficient as long as we are using ResultSet pagination.

Minding the Execution Plan

Now, a SQL statement is executed in three steps:

  • First, the SQL statement is parsed into an AST and a Query Tree is generated
  • Second, an Execution Plan is generated for the associated Query Tree
  • Third, the Execution Plan is executed and the ResultSet is sent to the client

The first and the second steps can be skipped if the database uses an Execution Plan cache (e.g. Oracle, SQL Server).

For the aforementioned SQL query, the RDBMS has no idea that we might need just a subset of all entries that are scanned by the query. For this reason, even if we have an index on the created_on column, the RDBMS will not use it if the amount of scanned rows is very large.

This is because reading from a non-covering index requires an additional random access read operation for the associated page so that the columns not included in the index can be fetched too. Because random access can become very expensive when the number of operations is high, the RDBMS might choose a sequential scan since sequential reads are faster than random access ones.

So, assuming we have the following database index in place:

CREATE INDEX idx_post_created_on ON post ( created_on DESC )

And, if we have 5000 post rows and get the Execution Plan for the following streaming query:

List<Object[]> executionPlanLines = doInJPA(entityManager -> {
    try(Stream<Object[]> postStream = entityManager
        .createNativeQuery(
            "EXPLAIN ANALYZE " +
            "SELECT p " +
            "FROM post p " +
            "ORDER BY p.created_on DESC")
        .setHint( QueryHints.HINT_FETCH_SIZE, 50 )
        .unwrap(Query.class)
        .stream()
    ) {
        return postStream.collect( Collectors.toList() );
    }
});

LOGGER.info( "Execution plan: {}",
             executionPlanLines
             .stream()
             .map( line -> (String) line[0] )
             .collect( Collectors.joining( "\n" ) )
);

We will get the following Execution Plan:

Execution plan: 
  Sort  (cost=65.53..66.83 rows=518 width=564) (actual time=4.339..5.473 rows=5000 loops=1)
  Sort Key: created_on DESC
  Sort Method: quicksort  Memory: 896kB
  ->  Seq Scan on post p  (cost=0.00..42.18 rows=518 width=564) (actual time=0.041..1.833 rows=5000 loops=1)
Planning time: 1.840 ms
Execution time: 6.611 ms

That’s bad! PostgreSQL chose a Full-table sequential scan instead of using the index because it assumed we were going to fetch all rows from the post table.

Instead, if we instruct PostgreSQL that we need just 50 entries:

List<String> executionPlanLines = doInJPA(entityManager -> {
    return entityManager
        .createNativeQuery(
            "EXPLAIN ANALYZE " +
            "SELECT p " +
            "FROM post p " +
            "ORDER BY p.created_on DESC")
        .setMaxResults( 50 )
        .unwrap(Query.class)
        .getResultList();
});

LOGGER.info( "Execution plan: {}",
             executionPlanLines
             .stream()
             .collect( Collectors.joining( "\n" ) )
);

We now get the following Execution Plan:

Execution plan: 
Limit  (cost=0.28..25.35 rows=50 width=564) (actual time=0.038..0.051 rows=50 loops=1)
  ->  Index Scan using idx_post_created_on on post p  (cost=0.28..260.04 rows=518 width=564) (actual time=0.037..0.049 rows=50 loops=1)
Planning time: 1.511 ms
Execution time: 0.148 ms

That’s more like it! Instead of a sequential scan, we now get an Index Scan which goes over 50 rows only.

Without passing the expected `ResultSet` size, the database will use an Execution Plan that is suitable for when all entries are needed to be extracted. Therefore, using streaming via a database cursor so to avoid fetching the whole `ResultSet` might yield a suboptimal Execution Plan.

Streaming and database cursors make a lot of sense inside database stored procedures when we need to process large amounts of data, and we want to reuse the same cursor while batch processing the entire ResultSet.

However, for sending data to a client, pagination is the best way to ensure that we fetch just as much data as we need.

I'm running an online workshop on the 20-21 and 23-24 of November about High-Performance Java Persistence.

If you enjoyed this article, I bet you are going to love my Book and Video Courses as well.

Conclusion

Although the JPA 2.2 query streaming feature aims to prevent you from fetching too much data, it’s much more efficient to use ResultSet pagination instead.

While the default OFFSET pagination, which is implemented by all major RDBMS and even defined by the SQL Standard 2008, is suitable when the query where clause predicates are highly selective, you have other options as well, like Keyset pagination or even more exotic options.

Transactions and Concurrency Control eBook

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.