How does MySQL result set streaming perform vs fetching the whole JDBC ResultSet at once

(Last Updated On: May 22, 2018)

Introduction

I read a very interesting article by Krešimir Nesek regarding MySQL result set streaming when it comes to reducing memory usage.

Mark Paluch, from Spring Data, asked if we could turn the MySQL result set streaming by default whenever we are using Query#stream or Query#scroll.

That being said, the HHH-11260 issue was created, and I started working on it. During Peer Review, Steve Ebersole (Hibernate ORM team leader) and Sanne Grinovero (Hibernate Search Team Leader) expressed their concerns regarding making such a change.

First of all, the MySQL result set streaming has the following caveats:

  • the ResultSet must be traversed fully before issuing any other SQL statement
  • the statement is not closed if there are still records to be read in the associated ResultSet
  • the locks associated with the underlying SQL statement that is being streamed are released when the transaction is ended (either commit or rollback).

Why streaming?

In the vast majority of situations, you don’t need result set streaming for the following reasons:

  • if you need to process a large volume of data, it’s much more efficient to process it in the database using a stored procedure. This is especially true for Oracle and SQL Server which offer a very solid procedural language.
  • if you need to process the data in the application, then batch processing is the way to go. That being said, you only need to select and process small amounts of data at a time. This allows you to prevent long-running transactions, which are undesirable for both 2PL and MVCC database transactions. By splitting the data set into multiple batches, you can better parallelize the data processing task.

The being said, the only reason you should be using streaming is to restrict the memory allocation on the client side while avoiding to execute an SQL statement for each batch execution.

However, issuing a new statement that fetches the current batch data can be a real advantage because the query can be paginated. If the filtered data set is fairly large, then you should be using Keyset Pagination, as Markus Winand explains in his SQL Performance Explained book. If the result set is not too large, then OFFSET pagination can be a solution as well.

Another great advantage of smaller paginated queries is index selectivity. If the filtered data set is rather large, it might be that you cannot benefit from indexing because the execution plan has decided to sue a sequential scan instead. Therefore the streaming query might be slow.

A paginated query that needs to scan a small data set can better take advantage of a database index because the cost of random access might be lower than the one associated with a sequential scan.

How MySQL streaming performs?

If you’re consuming the whole stream, just as Krešimir Nesek does in his article, then maybe you are better off using batch processing.

Let’s see what’s faster when it comes to consuming the whole ResultSetthe default fetch-all or the streaming alternative.

The default fetch-all is done as follows:

private void stream(EntityManager entityManager) {
    final AtomicLong sum = new AtomicLong();
    try(Stream<Post> postStream = entityManager
        .createQuery("select p from Post p", Post.class)
        .setMaxResults(resultSetSize)
        .unwrap(Query.class)
        .stream()) {
        postStream.forEach(post -> sum.incrementAndGet());
    }
    assertEquals(resultSetSize, sum.get());
}

while the JDBC Driver streaming is done using the org.hibernate.fetchSize Hibernate Query hint:

private void stream(EntityManager entityManager) {
    final AtomicLong sum = new AtomicLong();
    try(Stream<Post> postStream = entityManager
        .createQuery("select p from Post p", Post.class)
        .setMaxResults(resultSetSize)
        .setHint(QueryHints.HINT_FETCH_SIZE, Integer.MIN_VALUE)
        .unwrap(Query.class)
        .stream()) {
        postStream.forEach(post -> sum.incrementAndGet());
    }
    assertEquals(resultSetSize, sum.get());
}

The test does a warm-up of 25 000 method calls, and then it executes the stream method 10 000 times while measuring the fetch time using Dropwizard Metrics.

On the y-axis, the diagram shows the 98th percentile that was recorded by the Dropwizard Timer when consuming the whole ResultSet.
On the x-axis, the resultSetSize varies from 1, 2, 5, up to higher values (e.g. 5000).

mysqlstreaming

The response time grows with result set size. Therefore, in OLTP applications, you should always strive for keeping the JDBC ResultSet as small as possible. That’s why batch processing and pagination queries are usually a better alternative than streaming a large result set.

Code available on GitHub.

If you enjoyed this article, I bet you are going to love my Book and Video Courses as well.

Conclusion

Steve and Sanne assumptions turned out to be right. Streaming performs worse than just fetching the whole ResultSet at once, which is the default strategy for both MySQL and PostgreSQL JDBC drivers.

Therefore, it’s not advisable to make the change proposed by the HHH-11260 Jira issue. That being said, it’s up to you to decide if streaming makes sense for your use case, or whether you should be using batch processing with paginated queries.

Subscribe to our Newsletter

* indicates required
10 000 readers have found this blog worth following!

If you subscribe to my newsletter, you'll get:
  • A free sample of my Video Course about running Integration tests at warp-speed using Docker and tmpfs
  • 3 chapters from my book, High-Performance Java Persistence, 
  • a 10% discount coupon for my book. 
Get the most out of your persistence layer!

Advertisements

8 thoughts on “How does MySQL result set streaming perform vs fetching the whole JDBC ResultSet at once

  1. Hi Vlad! Thank you for good reading!

    But I think you compare apples with oranges.

    The data streaming used not for performance but for smaller memory usage. Of course the performance of streming data over the network will always be lower then fetching them at once. I can’t imagine a case when this would not be true.

    So, if you don’t care about memory consumption (read: all result data are fit into memory), then of course streaming is not required and should be avoided, because it is just slower.

    But, when you really need to process a large amount of data, and you don’t care too much about speed, but care about memory (you don’t want OutOfMemory errors in the middle of processing or even at almost end of it), then streaming is good option. Batching can be useful, but usually it is harder to implement when you just want fast solution for one-time task.

    So, I doesn’t understand your conclusion and decision about HHH-11260. If someone want to fetch all the data at once, why he would use a streaming API in the first place? Just use List then.

    But when someone decided to use a Stream, then his expectation is that data would be streamed all path down the way, up to DB connection. If not, than this is just a illusion of stream processing. The explicit list.steam() (if someone really need a Stream API just for convenient look of the code) is better in this case instead of misleading future code readers, who might think that there is really a stream of data between DB server and consuming code.

    1. You almost never need streaming. You either have to process data on the DB and save tons of networking IO or you need pagination.

      1. You are totally right! And I agree with you at this part. But when you really need it, then you expect this stream to work (to not load all data at once). And only then you should use Stream as return type.

        So I don’t see why “it’s not advisable to make the change proposed by the HHH-11260 Jira issue”. Isn’t this change affect only this case?

      2. Because this is a terrible default. You can simply supply the fetch size at query time using a JPA hint.

  2. “The response time grows exponentially with result set size”
    Why do you think so?
    It must be a bug somewhere if the response time grows exponentially with result set size. Proper growth should be linear.

    1. You’re right. I removed the exponentially term since the graph does not have a linear scale.

      1. “The response time grows with result set size. Therefore, in OLTP applications, you should always strive for keeping the JDBC ResultSet as small as possible”

        What “therefore” means there? What should I do if I need to fetch 100 rows? Should I artificially split that into 10 requests of 10 rows each?
        Should I always use 1 row per statement?
        Should fetch size be 1 or 100?

      2. Fetching 100 records is fast. Fetching 100k records is not, which is what you get from findAll methods exposed by Spring Data Repositories. This conclusion is also based on the fact that instructing the DB about the RS size will hint the Optimizer to choosing an index over a full-table scan. Now, fetch size should not be 1 or 10, For instance, 10 is what the Oracle JDBC Driver use, which is way too low. However, if you keep your transactions reasonably small, it’s best to just fetch the 50, 100 or 200 records on a single round-trip instead of using multiple fetch calls.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.