The best way to do batch processing with JPA and Hibernate

Introduction

Recently, on of my followers asked me to answer a question on Quora about batch processing, and, since the question was really interesting, I decided to turn it into a blog post.

In this article, you are going to find out what batch processing is, why do we use it, and how to use it properly with JPA and Hibernate.

Batch processing

When writing an enterprise application, it’s common to split your work between the front-end system which serves a typical OLTP (Online Transaction Processing) traffic, and one or more batch processors which are used for ETL (Extract, Transform, Load) purposes.

The batch processor, as its name implies, splits the data to be processed into several chunks, therefore offering the following advantages:

  • each chunk can be processed by a separate working thread, therefore increasing throughput and reducing the overall processing time
  • each chunk can use its own database transaction, so, if there is a failure, we don’t have to throw away all the work we have done, just the current transaction changes

JPA batch processing

When using JPA, assuming you want to insert 50 Post entities, this is how you should do it:

int entityCount = 50;
int batchSize = 25;

EntityManager entityManager = null;
EntityTransaction transaction = null;

try {
    entityManager = entityManagerFactory()
        .createEntityManager();

    transaction = entityManager.getTransaction();
    transaction.begin();

    for ( int i = 0; i < entityCount; ++i ) {
        if ( i > 0 && i % batchSize == 0 ) {
            entityManager.flush();
            entityManager.clear();

            transaction.commit();
            transaction.begin();
        }

        Post post = new Post( 
            String.format( "Post %d", i + 1 ) 
        );
        entityManager.persist( post );
    }

    transaction.commit();
} catch (RuntimeException e) {
    if ( transaction != null && 
         transaction.isActive()) {
        transaction.rollback();
    }
    throw e;
} finally {
    if (entityManager != null) {
        entityManager.close();
    }
}

A transaction is started from the every beginning since every entity state transition must execute within the scope of a database transaction.

The for loop persists one Post at a time. However, since entity state transitions are only executed at flush time, we can group several SQL INSERT statements into a single PreparedStatement execution that takes multiple parameter sets.

Every time iteration counter (e.g. i) has reached a multiple of the batchSize threshold value, we can flush the EntityManager and commit the database transaction. By committing the database transaction after every batch execution, we gain the following advantages:

  1. We avoid long-running transactions which are detrimental to MVCC relational database systems.
  2. We make sure that if we have a failure, we don’t lose the work done by the batch jobs that have previously executed successfully.

The EntityManager is cleared after every batch execution so that we don’t keep on accumulating managed entities which can cause several problems:

  1. If the number of entities to be persisted is huge, we risk of running out of memory.
  2. The more entities we accumulate in the Persistence Context, the slower the flush becomes. So, it’s good practice to make sure the Persistence Context as slim as possible.

If an exception is thrown, we must make sure to roll back the current running database transaction. Failing to do so, can cause many problems since the database might still think that the transaction is open and locks might be held until the transaction is ended by a timeout or by the DBA.

In the end, we need to close the EntityManager so that we can clear the context and deallocate Session-level resources.

While this is the proper way to do batch processing with JPA, we are not done yet. As previously explained, we can also benefit from JDBC batch updates. To do so, we need to provide the following Hibernate configuration properties:

<property
    name="hibernate.jdbc.batch_size" 
    value="25"
/>

<property
    name="hibernate.order_inserts"   
    value="true"
/>

<property
    name="hibernate.order_updates"   
    value="true"
/>

These properties allow us to batch multiple SQL statements into a single PreparedStatement execution which requires a single database roundtrip. The value of 25 was chosen to match the EntityManager batch job threshold.

When executing the previous test case, Hibernate generates just 2 SQL INSERT statements:

INSERT INTO post (title, id) values (?, ?)"], 
Params:[
    (Post 1, 1), (Post 2, 2), (Post 3, 3), 
    (Post 4, 4), (Post 5, 5), (Post 6, 6), 
    (Post 7, 7), (Post 8, 8), (Post 9, 9), 
    (Post 10, 10), (Post 11, 11), (Post 12, 12), 
    (Post 13, 13), (Post 14, 14), (Post 15, 15), 
    (Post 16, 16), (Post 17, 17), (Post 18, 18), 
    (Post 19, 19), (Post 20, 20), (Post 21, 21), 
    (Post 22, 22), (Post 23, 23), (Post 24, 24), 
    (Post 25, 25)
]

INSERT INTO post (title, id) values (?, ?)"], 
Params:[
    (Post 26, 26), (Post 27, 27), (Post 28, 28), 
    (Post 29, 29), (Post 30, 30), (Post 31, 31), 
    (Post 32, 32), (Post 33, 33), (Post 34, 34), 
    (Post 35, 35), (Post 36, 36), (Post 37, 37), 
    (Post 38, 38), (Post 39, 39), (Post 40, 40), 
    (Post 41, 41), (Post 42, 42), (Post 43, 43), 
    (Post 44, 44), (Post 45, 45), (Post 46, 46), 
    (Post 47, 47), (Post 48, 48), (Post 49, 49), 
    (Post 50, 50)
]

Brilliant!

If you enjoyed this article, I bet you are going to love my book as well.

Conclusion

Knowing how to properly design a batch processing job is very important when developing an enterprise application. Luckily, with JPA and Hibernate, this task is very easy to implement. The most important aspect is to take into consideration how a relational database works best, and that’s what should drive your data access design decisions.

Enter your email address to follow this blog and receive notifications of new posts by email.

Advertisements

10 thoughts on “The best way to do batch processing with JPA and Hibernate

  1. Pretty good article, Vlad.

    I think we may improve performance by using StatelessSession instead of EntityManager so that we can avoid issues related to persistence context and releasing resources.

    What do you think? Do you still see any advantage of using StatelessSession nowadays?

      1. What kind of limitations do you worry about?

        Some of its limitations exists to improve performance in batch processing so I think it’s fair enough in many situations. But I’d like to hear your opinion, Vlad.

        Thanks!

      2. The lack of the write-behind cache is a sufficient reason to avoid it. The write-behind strategy allows you to better use statement batching and delay the database connection acquisition.

      3. Yeah, unfortunately you’re right.

        I thought StatelessSession respected hibernate.jdbc.batch_size parameter but it doesn't. It doesn't make sense since this API was designed for batch processing. That's a shame! This Hibernate issue speaks about the problem with some details: https://hibernate.atlassian.net/browse/HHH-4042

        So I agree with you, favoring StatelessSession over statement batching does not make any sense.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s