The best way to do batch processing with JPA and Hibernate
Imagine having a tool that can automatically detect JPA and Hibernate performance issues. Wouldn’t that be just awesome?
Well, Hypersistence Optimizer is that tool! And it works with Spring Boot, Spring Framework, Jakarta EE, Java EE, Quarkus, or Play Framework.
So, enjoy spending your time on the things you love rather than fixing performance issues in your production system on a Saturday night!
Introduction
Recently, one of my followers asked me to answer a question on Quora about batch processing, and, since the question was really interesting, I decided to turn it into a blog post.
In this article, you are going to find out what batch processing is, why do we use it, and how to use it properly with JPA and Hibernate.
Batch processing
When writing an enterprise application, it’s common to split your work between the front-end system which serves a typical OLTP (Online Transaction Processing) traffic, and one or more batch processors which are used for ETL (Extract, Transform, Load) purposes.
The batch processor, as its name implies, splits the data to be processed into several chunks, therefore offering the following advantages:
- each chunk can be processed by a separate working thread, therefore increasing throughput and reducing the overall processing time
- each chunk can use its own database transaction, so, if there is a failure, we don’t have to throw away all the work we have done, just the current transaction changes
JPA batch processing
When using JPA, assuming you want to insert 50 Post
entities, this is how you should do it:
int entityCount = 50; int batchSize = 25; EntityManager entityManager = entityManagerFactory() .createEntityManager(); EntityTransaction entityTransaction = entityManager .getTransaction(); try { entityTransaction.begin(); for (int i = 0; i < entityCount; i++) { if (i > 0 && i % batchSize == 0) { entityTransaction.commit(); entityTransaction.begin(); entityManager.clear(); } Post post = new Post( String.format("Post %d", i + 1) ); entityManager.persist(post); } entityTransaction.commit(); } catch (RuntimeException e) { if (entityTransaction.isActive()) { entityTransaction.rollback(); } throw e; } finally { entityManager.close(); }
A transaction is started from the very beginning since every entity state transition must execute within the scope of a database transaction.
The for loop persists one Post
at a time. However, since entity state transitions are only executed at flush time, we can group several SQL INSERT statements into a single PreparedStatement
execution that takes multiple parameter sets.
Every time iteration counter (e.g. i
) has reached a multiple of the batchSize
threshold value, we can flush the EntityManager
and commit the database transaction. By committing the database transaction after every batch execution, we gain the following advantages:
- We avoid long-running transactions which are detrimental to MVCC relational database systems.
- We make sure that if we have a failure, we don’t lose the work done by the batch jobs that have previously executed successfully.
The EntityManager
is cleared after every batch execution so that we don’t keep on accumulating managed entities which can cause several problems:
- If the number of entities to be persisted is huge, we risk of running out of memory.
- The more entities we accumulate in the Persistence Context, the slower the flush becomes. So, it’s good practice to make sure the Persistence Context as slim as possible.
If an exception is thrown, we must make sure to roll back the current running database transaction. Failing to do so, can cause many problems since the database might still think that the transaction is open and locks might be held until the transaction is ended by a timeout or by the DBA.
In the end, we need to close the EntityManager
so that we can clear the context and deallocate Session
-level resources.
While this is the proper way to do batch processing with JPA, we are not done yet. As previously explained, we can also benefit from JDBC batch updates. To do so, we need to provide the following Hibernate configuration properties:
<property name="hibernate.jdbc.batch_size" value="25" /> <property name="hibernate.order_inserts" value="true" /> <property name="hibernate.order_updates" value="true" />
These properties allow us to batch multiple SQL statements into a single PreparedStatement
execution which requires a single database roundtrip. The value of 25 was chosen to match the EntityManager
batch job threshold.
When executing the previous test case, Hibernate generates just 2 SQL INSERT statements:
INSERT INTO post (title, id) values (?, ?)"], Params:[ (Post 1, 1), (Post 2, 2), (Post 3, 3), (Post 4, 4), (Post 5, 5), (Post 6, 6), (Post 7, 7), (Post 8, 8), (Post 9, 9), (Post 10, 10), (Post 11, 11), (Post 12, 12), (Post 13, 13), (Post 14, 14), (Post 15, 15), (Post 16, 16), (Post 17, 17), (Post 18, 18), (Post 19, 19), (Post 20, 20), (Post 21, 21), (Post 22, 22), (Post 23, 23), (Post 24, 24), (Post 25, 25) ] INSERT INTO post (title, id) values (?, ?)"], Params:[ (Post 26, 26), (Post 27, 27), (Post 28, 28), (Post 29, 29), (Post 30, 30), (Post 31, 31), (Post 32, 32), (Post 33, 33), (Post 34, 34), (Post 35, 35), (Post 36, 36), (Post 37, 37), (Post 38, 38), (Post 39, 39), (Post 40, 40), (Post 41, 41), (Post 42, 42), (Post 43, 43), (Post 44, 44), (Post 45, 45), (Post 46, 46), (Post 47, 47), (Post 48, 48), (Post 49, 49), (Post 50, 50) ]
Brilliant!
I'm running an online workshop on the 11th of October about High-Performance SQL.If you enjoyed this article, I bet you are going to love my Book and Video Courses as well.
Conclusion
Knowing how to properly design a batch processing job is very important when developing an enterprise application. Luckily, with JPA and Hibernate, this task is very easy to implement. The most important aspect is to take into consideration how a relational database works best, and that’s what should drive your data access design decisions.
