The best way to do batch processing with JPA and Hibernate

Imagine having a tool that can automatically detect if you are using JPA and Hibernate properly. Hypersistence Optimizer is that tool!

Introduction

Recently, one of my followers asked me to answer a question on Quora about batch processing, and, since the question was really interesting, I decided to turn it into a blog post.

In this article, you are going to find out what batch processing is, why do we use it, and how to use it properly with JPA and Hibernate.

Batch processing

When writing an enterprise application, it’s common to split your work between the front-end system which serves a typical OLTP (Online Transaction Processing) traffic, and one or more batch processors which are used for ETL (Extract, Transform, Load) purposes.

The batch processor, as its name implies, splits the data to be processed into several chunks, therefore offering the following advantages:

  • each chunk can be processed by a separate working thread, therefore increasing throughput and reducing the overall processing time
  • each chunk can use its own database transaction, so, if there is a failure, we don’t have to throw away all the work we have done, just the current transaction changes

JPA batch processing

When using JPA, assuming you want to insert 50 Post entities, this is how you should do it:

int entityCount = 50;
int batchSize = 25;

EntityManager entityManager = entityManagerFactory()
    .createEntityManager();
    
EntityTransaction entityTransaction = entityManager
    .getTransaction();

try {
    entityTransaction.begin();

    for (int i = 0; i < entityCount; i++) {
        if (i > 0 && i % batchSize == 0) {
            entityTransaction.commit();
            entityTransaction.begin();

            entityManager.clear();
        }

        Post post = new Post(
            String.format("Post %d", i + 1)
        );
        
        entityManager.persist(post);
    }

    entityTransaction.commit();
} catch (RuntimeException e) {
    if (entityTransaction.isActive()) {
        entityTransaction.rollback();
    }
    throw e;
} finally {
    entityManager.close();
}

A transaction is started from the very beginning since every entity state transition must execute within the scope of a database transaction.

The for loop persists one Post at a time. However, since entity state transitions are only executed at flush time, we can group several SQL INSERT statements into a single PreparedStatement execution that takes multiple parameter sets.

Every time iteration counter (e.g. i) has reached a multiple of the batchSize threshold value, we can flush the EntityManager and commit the database transaction. By committing the database transaction after every batch execution, we gain the following advantages:

  1. We avoid long-running transactions which are detrimental to MVCC relational database systems.
  2. We make sure that if we have a failure, we don’t lose the work done by the batch jobs that have previously executed successfully.

The EntityManager is cleared after every batch execution so that we don’t keep on accumulating managed entities which can cause several problems:

  1. If the number of entities to be persisted is huge, we risk of running out of memory.
  2. The more entities we accumulate in the Persistence Context, the slower the flush becomes. So, it’s good practice to make sure the Persistence Context as slim as possible.

If an exception is thrown, we must make sure to roll back the current running database transaction. Failing to do so, can cause many problems since the database might still think that the transaction is open and locks might be held until the transaction is ended by a timeout or by the DBA.

In the end, we need to close the EntityManager so that we can clear the context and deallocate Session-level resources.

While this is the proper way to do batch processing with JPA, we are not done yet. As previously explained, we can also benefit from JDBC batch updates. To do so, we need to provide the following Hibernate configuration properties:

<property
    name="hibernate.jdbc.batch_size" 
    value="25"
/>

<property
    name="hibernate.order_inserts"   
    value="true"
/>

<property
    name="hibernate.order_updates"   
    value="true"
/>

These properties allow us to batch multiple SQL statements into a single PreparedStatement execution which requires a single database roundtrip. The value of 25 was chosen to match the EntityManager batch job threshold.

When executing the previous test case, Hibernate generates just 2 SQL INSERT statements:

INSERT INTO post (title, id) values (?, ?)"], 
Params:[
    (Post 1, 1), (Post 2, 2), (Post 3, 3), 
    (Post 4, 4), (Post 5, 5), (Post 6, 6), 
    (Post 7, 7), (Post 8, 8), (Post 9, 9), 
    (Post 10, 10), (Post 11, 11), (Post 12, 12), 
    (Post 13, 13), (Post 14, 14), (Post 15, 15), 
    (Post 16, 16), (Post 17, 17), (Post 18, 18), 
    (Post 19, 19), (Post 20, 20), (Post 21, 21), 
    (Post 22, 22), (Post 23, 23), (Post 24, 24), 
    (Post 25, 25)
]

INSERT INTO post (title, id) values (?, ?)"], 
Params:[
    (Post 26, 26), (Post 27, 27), (Post 28, 28), 
    (Post 29, 29), (Post 30, 30), (Post 31, 31), 
    (Post 32, 32), (Post 33, 33), (Post 34, 34), 
    (Post 35, 35), (Post 36, 36), (Post 37, 37), 
    (Post 38, 38), (Post 39, 39), (Post 40, 40), 
    (Post 41, 41), (Post 42, 42), (Post 43, 43), 
    (Post 44, 44), (Post 45, 45), (Post 46, 46), 
    (Post 47, 47), (Post 48, 48), (Post 49, 49), 
    (Post 50, 50)
]

Brilliant!

If you enjoyed this article, I bet you are going to love my Book and Video Courses as well.

Seize the deal! 50% discount. Seize the deal! 50% discount.

Conclusion

Knowing how to properly design a batch processing job is very important when developing an enterprise application. Luckily, with JPA and Hibernate, this task is very easy to implement. The most important aspect is to take into consideration how a relational database works best, and that’s what should drive your data access design decisions.

FREE EBOOK

12 Comments on “The best way to do batch processing with JPA and Hibernate

  1. I think (not entirely sure) that you left out one key assumption in this article. I’m pretty sure in order for this to work, you need resource-local entitymanagers. If you’re using JTA instead (where the entitymanager is managed by the container and you inject the entitymanager by @PersistenceContext instead of acquiring it via an EntityManagerFactory) this won’t work. Please correct me if I’m wrong

    • This would work just fine with either RL or JTA. Of course, it doesn’t make much sense to do batch processing in the context of an OLTP transactional service method where you’d typically have the EM injected and wrapped by the TransactionalInterceptor. Batch processing works best with Enterprise Integration Patterns designs like Apache Camel or Spring Integration.

  2. Hi, Vlad!

    I think there might be an issue in this code block:

    for (int i = 0; i < entityCount; i++) {
            if (i > 0 && i % batchSize == 0) {
    

    Because, for the 1st iteration, i == 0 and when i == 25, there will be 26 items in the batch.

    • The logic is fine as proven by the associated test case in my High-Performance Java Persistence. So, it’s not 26, but 25, as the flush happens at the beginning of the next iteration, prior to persisting the 26th item.

  3. Does it mean you don’t use spring’s @Transactional at service class in case of Batch Processing?
    Thanks

    • If you use Enterprise Integration Patterns, then the persistence components can use @Transactional just fine.

  4. Great article. Do you have any advice or tought for batching process that consist first on reading many records with joins before. Say for instance before inserting any new instance Post, first we query another table “Select from PostLegacy p join p.user u join u.interests interest”, and suppose PostLegacy has millions of records, if we add a Paging strategy (offset, resultQuantity) directly in the query, does the results perform the same on the first results and the last ones?, I’m thinking of how the DB determines how to get the last results at an specific offset. Does it have to recreate and traverse all the resulset for that specific position? in that case would it be ok to consider that this is a bad case to use paging?. what would you recommend? Thanks in advance. Following you since the start dates of JOOQ 🙂

  5. I’m currently having a problem that I wish to mention.

    Let’s say these many posts are done by a (single) user, so the user is a ManyToOne in Post en User has a collection Posts. Do I need to retrieve the User at the beginning of the transaction each time to put it on the new Post and call user.addPost(post)?

    Or is there a way to get the User at the beginning of the Batch job only?

    Perhaps there’s a way to just put an FK on the Post and not bother with a User entity at all?

  6. “Recently, on of my followers”, typo, needs to be “one” instead of “on”

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.