How does MySQL result set streaming perform vs fetching the whole JDBC ResultSet at once

Introduction

I read a very interesting article by Krešimir Nesek regarding MySQL result set streaming when it comes to reducing memory usage.

Mark Paluch, from Spring Data, asked if we could turn the MySQL result set streaming by default whenever we are using Query#stream or Query#scroll.

That being said, the HHH-11260 issue was created, and I started working on it. During Peer Review, Steve Ebersole (Hibernate ORM team leader) and Sanne Grinovero (Hibernate Search Team Leader) expressed their concerns regarding making such a change.

First of all, the MySQL result set streaming has the following caveats:

  • the ResultSet must be traversed fully before issuing any other SQL statement
  • the statement is not closed if there are still records to be read in the associated ResultSet
  • the locks associated with the underlying SQL statement that is being streamed are released when the transaction is ended (either commit or rollback).

Why streaming?

In the vast majority of situations, you don’t need result set streaming for the following reasons:

  • if you need to process a large volume of data, it’s much more efficient to process it in the database using a stored procedure. This is especially true for Oracle and SQL Server which offer a very solid procedural language.
  • if you need to process the data in the application, then batch processing is the way to go. That being said, you only need to select and process small amounts of data at a time. This allows you to prevent long-running transactions, which are undesirable for both 2PL and MVCC database transactions. By splitting the data set into multiple batches, you can better parallelize the data processing task.

The being said, the only reason you should be using streaming is to restrict the memory allocation on the client side while avoiding to execute an SQL statement for each batch execution.

However, issuing a new statement that fetches the current batch data can be a real advantage because the query can be paginated. If the filtered data set is fairly large, then you should be using Keyset Pagination, as Markus Winand explains in his SQL Performance Explained book. If the result set is not too large, then OFFSET pagination can be a solution as well.

Another great advantage of smaller paginated queries is index selectivity. If the filtered data set is rather large, it might be that you cannot benefit from indexing because the execution plan has decided to sue a sequential scan instead. Therefore the streaming query might be slow.

A paginated query that needs to scan a small data set can better take advantage of a database index because the cost of random access might be lower than the one associated with a sequential scan.

How MySQL streaming performs?

If you’re consuming the whole stream, just as Krešimir Nesek does in his article, then maybe you are better off using batch processing.

Let’s see what’s faster when it comes to consuming the whole ResultSetthe default fetch-all or the streaming alternative.

The default fetch-all is done as follows:

private void stream(EntityManager entityManager) {
    final AtomicLong sum = new AtomicLong();
    try(Stream<Post> postStream = entityManager
        .createQuery("select p from Post p", Post.class)
        .setMaxResults(resultSetSize)
        .unwrap(Query.class)
        .stream()) {
        postStream.forEach(post -> sum.incrementAndGet());
    }
    assertEquals(resultSetSize, sum.get());
}

while the JDBC Driver streaming is done using the org.hibernate.fetchSize Hibernate Query hint:

private void stream(EntityManager entityManager) {
    final AtomicLong sum = new AtomicLong();
    try(Stream<Post> postStream = entityManager
        .createQuery("select p from Post p", Post.class)
        .setMaxResults(resultSetSize)
        .setHint(QueryHints.HINT_FETCH_SIZE, Integer.MIN_VALUE)
        .unwrap(Query.class)
        .stream()) {
        postStream.forEach(post -> sum.incrementAndGet());
    }
    assertEquals(resultSetSize, sum.get());
}

The test does a warm-up of 25 000 method calls, and then it executes the stream method 10 000 times while measuring the fetch time using Dropwizard Metrics.

On the y-axis, the diagram shows the 98th percentile that was recorded by the Dropwizard Timer when consuming the whole ResultSet.
On the x-axis, the resultSetSize varies from 1, 2, 5, up to higher values (e.g. 5000).

mysqlstreaming

The response time grows exponentially with result set size. Therefore, in OLTP applications, you should always strive for keeping the JDBC ResultSet as small as possible. That’s why batch processing and pagination queries are usually a better alternative than streaming a large result set.

Code available on GitHub.

If you enjoyed this article, I bet you are going to love my book as well.

Conclusion

Steve and Sanne assumptions turned out to be right. Streaming performs worse than just fetching the whole ResultSet at once, which is the default strategy for both MySQL and PostgreSQL JDBC drivers.

Therefore, it’s not advisable to make the change proposed by the HHH-11260 Jira issue. That being said, it’s up to you to decide if streaming makes sense for your use case, or whether you should be using batch processing with paginated queries.

Enter your email address to follow this blog and receive notifications of new posts by email.

Advertisements

20 thoughts on “How does MySQL result set streaming perform vs fetching the whole JDBC ResultSet at once

  1. Oh, that one is easy to fix! Instead of:

    private void stream(EntityManager entityManager) {
    final AtomicLong sum = new AtomicLong();
    try(Stream postStream = entityManager
    .createQuery(“select p from Post p”, Post.class)
    .setMaxResults(resultSetSize)
    .unwrap(Query.class)
    .stream()) {
    postStream.forEach(post -> sum.incrementAndGet());
    }
    assertEquals(resultSetSize, sum.get());
    }

    Just write:

    SELECT count(*) FROM post

    Tons of performance improvement… Cheers 😉

    1. True that! The query is meant to consume the stream, and it’s meant for the exercise sake.

      However, I already mentioned that processing data in the application is a good alternative to fetching tons of data from the DB to the application, just to make the processing in Java.

      I just wanted to raise awareness that streaming has its price in MySQL.

  2. Great tip!

    I think I just used result set streaming (via Hibernate ScrollableResults) once during all my life. That was necessary to generate a PDF report which in case needed LARGE amount of data.

    In batch processing I usually favor paginated queries!

  3. Hi Vladimir, nice article.
    After reading i wonder if it is posible using hibernate to tell the driver to use java.sql.ResultSet.TYPE_FORWARD_ONLY anda java.sql.ResultSet.CONCUR_READ_ONLY as is recommended by Krešimir in that post to improve streaming performance from database.
    I have checked your code and i have not found that piece of code.

    Thanks!

      1. I’d wait for the MariaDB Driver to fix this issue, and then validate it against Hibernate streaming. If it still doesn’t work, then we can investigate if we also have a bug in Hibernate core.

    1. Thanks, I fixed the Jira issue link issue. As for your statement about auto-commit, that’s not the case here. Hibernate disables the auto-commit for RESOURCE_LOCAL transactions anyway.

  4. I find this post to be a little misleading. The conclusion does align with the title of the post but the entire “Why Streaming” section seems out of place. In this section you mention the some of the pros and cons of the streaming result set but none of them are actually shown in the tests provided. Specifically you mention streaming result sets should only be used when you are iterating an entire result set and have memory constraints. Neither of which are reflected in the tests. It would be nice to see a comparison between streaming result set vs pagination with limit / offset vs pagination with keyset pagination. Also, it looks like the response time graph has linear growth not exponential, it just appears that way because of the scale of the x axis. 1000 looks like it takes half the time as 2000 and the same for 2500 and 5000. This would make sense because its returning twice the data.

      1. It looks like the other post just references back to this post for the performance which isn’t a fair comparison. These unit tests are basically comparing select * from posts limit N. A better comparison between pagination and streaming result sets would be select * from posts with streaming vs select * from posts with a page size of N. Every test should return all 5000 records which is currently not happening only the first N records are fetched. In the unit tests you can see assertEquals(resultSetSize, sum.get());.

        If you are only looking for a page or two of data you wouldn’t choose a streaming result set. But if you want to process ALL rows in a table its not a bad choice. It’s faster than pagination with limit / offset as shown in Krešimir Nesek’s article. I would guess its also faster or comparable to keyset pagination. Now the benefit of keyset pagination would be you can parallelize the queries for added performance but just comparing a single thread I would wager that a streaming result set would be faster.

        Running the same queries with pagination / keyset pagination and streaming where all 5000 records are returned in every test for varying page sizes would be a more accurate comparison of pagination vs streaming. The streaming result set would probably be constant across all the tests but the pagination tests might change. This is one of the only use cases where a streaming result set would be beneficial.

      2. The tests are on GitHub, so feel free to provide a Pull Request that backs up your claims.

        The goal of this post is to prove that fetching the entire result set at once is much more efficient than doing it in multiple roundtrips.

        As for pagination vs streaming, the execution plan on the other post proves my point.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s