The best way to do batch processing with JPA and Hibernate

Imagine having a tool that can automatically detect JPA and Hibernate performance issues. Wouldn’t that be just awesome?

Well, Hypersistence Optimizer is that tool! And it works with Spring Boot, Spring Framework, Jakarta EE, Java EE, Quarkus, or Play Framework.

So, enjoy spending your time on the things you love rather than fixing performance issues in your production system on a Saturday night!

Introduction

Recently, one of my followers asked me to answer a question on Quora about batch processing, and, since the question was really interesting, I decided to turn it into a blog post.

In this article, you are going to find out what batch processing is, why do we use it, and how to use it properly with JPA and Hibernate.

Batch processing

When writing an enterprise application, it’s common to split your work between the front-end system which serves a typical OLTP (Online Transaction Processing) traffic, and one or more batch processors which are used for ETL (Extract, Transform, Load) purposes.

The batch processor, as its name implies, splits the data to be processed into several chunks, therefore offering the following advantages:

  • each chunk can be processed by a separate working thread, therefore increasing throughput and reducing the overall processing time
  • each chunk can use its own database transaction, so, if there is a failure, we don’t have to throw away all the work we have done, just the current transaction changes

JPA batch processing

When using JPA, assuming you want to insert 50 Post entities, this is how you should do it:

int entityCount = 50;
int batchSize = 25;

EntityManager entityManager = entityManagerFactory()
    .createEntityManager();
    
EntityTransaction entityTransaction = entityManager
    .getTransaction();

try {
    entityTransaction.begin();

    for (int i = 0; i < entityCount; i++) {
        if (i > 0 && i % batchSize == 0) {
            entityTransaction.commit();
            entityTransaction.begin();

            entityManager.clear();
        }

        Post post = new Post(
            String.format("Post %d", i + 1)
        );
        
        entityManager.persist(post);
    }

    entityTransaction.commit();
} catch (RuntimeException e) {
    if (entityTransaction.isActive()) {
        entityTransaction.rollback();
    }
    throw e;
} finally {
    entityManager.close();
}

A transaction is started from the very beginning since every entity state transition must execute within the scope of a database transaction.

The for loop persists one Post at a time. However, since entity state transitions are only executed at flush time, we can group several SQL INSERT statements into a single PreparedStatement execution that takes multiple parameter sets.

Every time iteration counter (e.g. i) has reached a multiple of the batchSize threshold value, we can flush the EntityManager and commit the database transaction. By committing the database transaction after every batch execution, we gain the following advantages:

  1. We avoid long-running transactions which are detrimental to MVCC relational database systems.
  2. We make sure that if we have a failure, we don’t lose the work done by the batch jobs that have previously executed successfully.

The EntityManager is cleared after every batch execution so that we don’t keep on accumulating managed entities which can cause several problems:

  1. If the number of entities to be persisted is huge, we risk of running out of memory.
  2. The more entities we accumulate in the Persistence Context, the slower the flush becomes. So, it’s good practice to make sure the Persistence Context as slim as possible.

If an exception is thrown, we must make sure to roll back the current running database transaction. Failing to do so, can cause many problems since the database might still think that the transaction is open and locks might be held until the transaction is ended by a timeout or by the DBA.

In the end, we need to close the EntityManager so that we can clear the context and deallocate Session-level resources.

While this is the proper way to do batch processing with JPA, we are not done yet. As previously explained, we can also benefit from JDBC batch updates. To do so, we need to provide the following Hibernate configuration properties:

<property
    name="hibernate.jdbc.batch_size" 
    value="25"
/>

<property
    name="hibernate.order_inserts"   
    value="true"
/>

<property
    name="hibernate.order_updates"   
    value="true"
/>

These properties allow us to batch multiple SQL statements into a single PreparedStatement execution which requires a single database roundtrip. The value of 25 was chosen to match the EntityManager batch job threshold.

When executing the previous test case, Hibernate generates just 2 SQL INSERT statements:

INSERT INTO post (title, id) values (?, ?)"], 
Params:[
    (Post 1, 1), (Post 2, 2), (Post 3, 3), 
    (Post 4, 4), (Post 5, 5), (Post 6, 6), 
    (Post 7, 7), (Post 8, 8), (Post 9, 9), 
    (Post 10, 10), (Post 11, 11), (Post 12, 12), 
    (Post 13, 13), (Post 14, 14), (Post 15, 15), 
    (Post 16, 16), (Post 17, 17), (Post 18, 18), 
    (Post 19, 19), (Post 20, 20), (Post 21, 21), 
    (Post 22, 22), (Post 23, 23), (Post 24, 24), 
    (Post 25, 25)
]

INSERT INTO post (title, id) values (?, ?)"], 
Params:[
    (Post 26, 26), (Post 27, 27), (Post 28, 28), 
    (Post 29, 29), (Post 30, 30), (Post 31, 31), 
    (Post 32, 32), (Post 33, 33), (Post 34, 34), 
    (Post 35, 35), (Post 36, 36), (Post 37, 37), 
    (Post 38, 38), (Post 39, 39), (Post 40, 40), 
    (Post 41, 41), (Post 42, 42), (Post 43, 43), 
    (Post 44, 44), (Post 45, 45), (Post 46, 46), 
    (Post 47, 47), (Post 48, 48), (Post 49, 49), 
    (Post 50, 50)
]

Brilliant!

I'm running an online workshop on the 11th of October about High-Performance SQL.

If you enjoyed this article, I bet you are going to love my Book and Video Courses as well.

Conclusion

Knowing how to properly design a batch processing job is very important when developing an enterprise application. Luckily, with JPA and Hibernate, this task is very easy to implement. The most important aspect is to take into consideration how a relational database works best, and that’s what should drive your data access design decisions.

Transactions and Concurrency Control eBook

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.