A beginner’s guide to Cache synchronization strategies

Introduction

A system of record is the authoritative data source when information is scattered among various data providers. When we introduce a caching solution, we automatically duplicate our data. To avoid inconsistent reads and data integrity issues, it’s very important to synchronize the database and the cache (whenever a change occurs into the system).

There are various ways to keep the cache and the underlying database in sync and this article will present some of the most common cache synchronization strategies.

Cache-aside

The application code can manually manage both the database and the cache information. The application logic inspects the cache before hitting the database and it updates the cache after any database modification.

CacheAside

Mixing caching management and application is not very appealing, especially if we have to repeat these steps in every data retrieval method. Leveraging an Aspect-Oriented caching interceptor can mitigate the cache leaking into the application code, but it doesn’t exonerate us from making sure that both the database and the cache are properly synchronized.

Read-through

Instead of managing both the database and the cache, we can simply delegate the database synchronization to the cache provider. All data interactions is therefore done through the cache abstraction layer.

CacheReadThrough

Upon fetching a cache entry, the Cache verifies the cached element availability and loads the underlying resource on our behalf. The application uses the cache as the system of record and the cache is able to auto-populate on demand.

Write-through

Analogous to the read-through data fetching strategy, the cache can update the underlying database every time a cache entry is changed.

CacheWriteThrough

Although the database and the cache are updated synchronously, we have the liberty of choosing the transaction boundaries according to our current business requirements.

  • If strong consistency is mandatory and the cache provider offers an XAResource we can then enlist the cache and the database in the same global transaction. The database and the cache are therefore updated in a single atomic unit-of-work
  • If consistency can be weaken, we can update the cache and the database sequentially, without using a global transaction. Usually the cache is changed first and if the database update fails, the cache can use a compensating action to roll-back the current transaction changes

Write-behind

If strong consistency is not mandated, we can simply enqueue the cache changes and periodically flush them to the database.

CacheWriteBehind

This strategy is employed by the Java Persistence EntityManager (first-level cache), all entity state transitions being flushed towards the end of the current running transaction (or when a query is issued).

Although it breaks transaction guarantees, the write-behind caching strategy can outperform the write-through policy, because database updates can be batched and the number of DML transactions is also reduced.

If you have enjoyed reading my article and you’re looking forward to getting instant email notifications of my latest posts, consider following my blog.

Things to consider before jumping to enterprise caching

Introduction

Relational database transactions are ACID and the strong consistency model simplifies application development. Because enabling Hibernate caching is one configurations away, it’s very appealing to turn to caching whenever the data access layer starts showing performance issues. Adding a caching layer can indeed improve application performance, but it has its price and you need to be aware of it.

Database performance tuning

The database is therefore the central part of any enterprise application, containing valuable business assets. A database server has limited resources and it can therefore serve a finite number of connections. The shorter the database transactions, the more transactions can be accommodated. The first performance tuning action is to reduce the query execution times by indexing properly and optimizing queries.

When all queries and statements are optimized, we can either add more resources (scale up) or adding more database nodes (scale out). Horizontal scaling requires database replication, which implies synchronizing nodes. Synchronous replication preserve strong consistency, while asynchronous master-slave replication leads to eventual consistency.

Analogous to database replication challenges, cache nodes induce data synchronization issues, especially for distributed enterprise applications.

Caching

Even if the database access patterns are properly optimized, higher loads might increase latency. To provide predictable and constant response times, we need to turn to caching. Caching allows us to reuse a database response for multiple user requests.

The cache can therefore:

  • reduce CPU/Memory/IO resource consumption on the database side
  • reduce network traffic between application nodes and the database tier
  • provide constant result fetch time, insensitive to traffic bursts
  • provide a read-only view when the application is in maintenance mode (e.g. when upgrading the database schema)

The downside of introducing a caching solution is that data is duplicated in two separate technologies that may easily desynchronise.

In the simplest use case you have one database server and one cache node:

SingleCacheNode

The caching abstraction layer is aware of the database server, but the database knows nothing of the application-level cache. If some external process updates the database without touching the cache, the two data sources will get out of sync. Because few database servers support application-level notifications, the cache may break the strong consistency guarantees.

To avoid eventual consistency, both the database and the cache need to be enrolled in a distributed XA transaction, so the affected cache entries are either updated or invalidated synchronously.

Most often, there are more application nodes or multiple distinct applications (web-fronts, batch processors, schedulers) comprising the whole enterprise system:

MultipleCacheNodes

If each node has its own isolated cache node, we need to be aware of possible data synchronisation issues. If one node updates the database and its own cache without notifying the rest, then other cache nodes get out of sync.

In a distributed environment, when multiple applications or application nodes use caching, we need to use a distributed caching solution, so that:

  • cache nodes communicate in a peer-to-peer topology
  • cache nodes communicate in a client-server topology and a central cache server takes care of data synchronization

DistributedCacheNodes

Conclusion

Caching is a fine scaling technique but you have to be aware of possible consistency issues. Taking into consideration your current project data integrity requirements, you need to design your application to take advantage of caching without compromising critical data.

Caching is not a cross-cutting concern, leaking into your application architecture and requiring a well-thought plan for compensating data integrity anomalies.

If you have enjoyed reading my article and you’re looking forward to getting instant email notifications of my latest posts, consider following my blog.

How does Hibernate store second-level cache entries

Introduction

The benefit of using a database access abstraction layer is that caching can be implemented transparently, without leaking into the business logic code. Hibernate Persistence Context acts as a transactional write-behind cache, translating entity state transitions into DML statements.

The Persistence Context acts as a logical transaction storage, and each Entity instance can have at-most one managed reference. No matter how many times we try to load the same Entity, the Hibernate Session will always return the same object reference. This behavior is generally depicted as the first-level cache.

The Hibernate Persistence Context is not a caching solution per se, serving a different purpose than boosting application read operation performance. Because the Hibernate Session is bound to the currently running logical transaction, once the transaction is over, the Session is being destroyed.

The second-level cache

A proper caching solution would have to span across multiple Hibernate Sessions and that’s the reason Hibernate supports an additional second-level cache as well. The second-level cache is bound to the SessionFactory life-cycle, so it’s destroyed only when the SessionFactory is closed (topically when the application is shutting down). The second-level cache is primarily entity-based oriented, although it supports an optional query-caching solution as well.

By default, the second-level cache is disabled and to activate it, we have to set the following Hibernate properties:

properties.put("hibernate.cache.use_second_level_cache", 
    Boolean.TRUE.toString());
properties.put("hibernate.cache.region.factory_class", 
    "org.hibernate.cache.ehcache.EhCacheRegionFactory");

The RegionFactory defines the second-level cache implementation provider, and the hibernate.cache.region.factory_class configuration is mandatory, once the hibernate.cache.use_second_level_cache property is set to true.

To enable entity-level caching, we need to annotate our cacheable entities as follows:

@Entity
@org.hibernate.annotations.Cache(usage = 
    CacheConcurrencyStrategy.NONSTRICT_READ_WRITE)

JPA also defines the @Cacheable annotation, but it doesn’t support setting the concurrency strategy on entity-level.

The entity loading flow

Whenever an entity is to be loaded, a LoadEevent is fired and the DefaultLoadEventListener handles it as follows:

Object entity = loadFromSessionCache( event, 
    keyToLoad, options );
if ( entity == REMOVED_ENTITY_MARKER ) {
    LOG.debug("Load request found matching entity 
        in context, but it is scheduled for removal;
        returning null" );
    return null;
}
if ( entity == INCONSISTENT_RTN_CLASS_MARKER ) {
    LOG.debug("Load request found matching entity 
        in context, but the matched entity was of
        an inconsistent return type;
        returning null"
    );
    return null;
}
if ( entity != null ) {
    if ( traceEnabled ) {
        LOG.tracev("Resolved object in "
            + "session cache: {0}",
            MessageHelper.infoString( persister,
                event.getEntityId(),
                event.getSession().getFactory() )
        );
    }
    return entity;
}

entity = loadFromSecondLevelCache( event, 
    persister, options );
if ( entity != null ) {
    if ( traceEnabled ) {
        LOG.tracev("Resolved object in "
            + "second-level cache: {0}",
            MessageHelper.infoString( persister,
                event.getEntityId(),
                event.getSession().getFactory() )
        );
    }
}
else {
    if ( traceEnabled ) {
        LOG.tracev("Object not resolved in "
            + "any cache: {0}",
            MessageHelper.infoString( persister,
                event.getEntityId(),
                event.getSession().getFactory() )
        );
    }
    entity = loadFromDatasource( event, persister, 
        keyToLoad, options );
}

The Session is always inspected first, because it might already contain a managed entity instance. The second-level cache is verified before hitting the database, so its main purpose is to reduce the number of database accesses.

Second-level cache internals

Every entity is stored as a CacheEntry, and the entity hydrated state is used for creating the cache entry value.

Hydration

In Hibernate nomenclature, hydration is when a JDBC ResultSet is transformed to an array of raw values:

final Object[] values = persister.hydrate(
    rs, id, object, 
    rootPersister, cols, eagerPropertyFetch, session
);

The hydrated state is saved in the currently running Persistence Context as an EntityEntry object, which encapsulated the loading-time entity snapshot. The hydrated state is then used by:

  • the default dirty checking mechanism, which compares the current entity data against the loading-time snapshot
  • the second-level cache, whose cache entries are built from the the loading-time entity snapshot

The inverse operation is called dehydration and it copies the entity state into an INSERT or UPDATE statement.

The second-level cache elements

Although Hibernate allows us to manipulate entity graphs, the second-level cache uses a disassembled hydrated state instead:

final CacheEntry entry = persister.buildCacheEntry( 
    entity, hydratedState, version, session );

The hydrated state is disassembled prior to being stored in the CacheEntry:

this.disassembledState = TypeHelper.disassemble(
    state, persister.getPropertyTypes(),
    persister.isLazyPropertiesCacheable() 
        ? null : persister.getPropertyLaziness(),
    session, owner
);

Starting from the following entity model diagram:

PostCommentDetailsSecondLevelCache

We’ll insert the following entities:

Post post = new Post();
post.setName("Hibernate Master Class");

post.addDetails(new PostDetails());
post.addComment(new Comment("Good post!"));
post.addComment(new Comment("Nice post!"));

session.persist(post);

Now, we are going to check each individual entity cache element.

The Post entity cache element

The Post entity has a one-to-many association to the Comment entity and an inverse one-to-one association to a PostDetails:

@OneToMany(cascade = CascadeType.ALL, 
    mappedBy = "post")
private List<Comment> comments = new ArrayList<>();

@OneToOne(cascade = CascadeType.ALL, 
    mappedBy = "post", optional = true)
private PostDetails details;

When fetching a Post entity:

Post post = (Post) session.get(Post.class, 1L);

The associated cache element looks like this:

key = {org.hibernate.cache.spi.CacheKey@3855}
    key = {java.lang.Long@3860} "1"
    type = {org.hibernate.type.LongType@3861} 
    entityOrRoleName = {java.lang.String@3862} "com.vladmihalcea.hibernate.masterclass.laboratory.cache.SecondLevelCacheTest$Post"
    tenantId = null
    hashCode = 31
value = {org.hibernate.cache.spi.entry.StandardCacheEntryImpl@3856}
    disassembledState = {java.io.Serializable[3]@3864} 
        0 = {java.lang.Long@3860} "1"
        1 = {java.lang.String@3865} "Hibernate Master Class"
    subclass = {java.lang.String@3862} "com.vladmihalcea.hibernate.masterclass.laboratory.cache.SecondLevelCacheTest$Post"
    lazyPropertiesAreUnfetched = false
    version = null

The CacheKey contains the entity identifier and the CacheEntry contains the entity disassembled hydrated state.

The Post entry cache value consists of the name column and the id, which is set by the one-to-many Comment association.

Neither the one-to-many nor the inverse one-to-one associations are embedded in the Post CacheEntry.

The PostDetails entity cache element

The PostDetails entity Primary Key is referencing the associated Post entity Primary Key, and it therefore has a one-to-one association with the Post entity.

@OneToOne
@JoinColumn(name = "id")
@MapsId
private Post post;

When fetching a PostDetails entity:

PostDetails postDetails = 
    (PostDetails) session.get(PostDetails.class, 1L);

The second-level cache generate the following cache element:

key = {org.hibernate.cache.spi.CacheKey@3927}
    key = {java.lang.Long@3897} "1"
    type = {org.hibernate.type.LongType@3898} 
    entityOrRoleName = {java.lang.String@3932} "com.vladmihalcea.hibernate.masterclass.laboratory.cache.SecondLevelCacheTest$PostDetails"
    tenantId = null
    hashCode = 31
value = {org.hibernate.cache.spi.entry.StandardCacheEntryImpl@3928}
    disassembledState = {java.io.Serializable[2]@3933} 
        0 = {java.sql.Timestamp@3935} "2015-04-06 15:36:13.626"
    subclass = {java.lang.String@3932} "com.vladmihalcea.hibernate.masterclass.laboratory.cache.SecondLevelCacheTest$PostDetails"
    lazyPropertiesAreUnfetched = false
    version = null

The disassembled state contains only the createdOn entity property, since the entity identifier is embedded in the CacheKey.

The Comment entity cache element

The Comment entity has a many-to-one association to a Post:

@ManyToOne
private Post post;

When we fetch a Comment entity:

Comment comments = 
    (Comment) session.get(Comment.class, 1L);

Hibernate generates the following second-level cache element:

key = {org.hibernate.cache.spi.CacheKey@3857}
    key = {java.lang.Long@3864} "2"
    type = {org.hibernate.type.LongType@3865} 
    entityOrRoleName = {java.lang.String@3863} "com.vladmihalcea.hibernate.masterclass.laboratory.cache.SecondLevelCacheTest$Comment"
    tenantId = null
    hashCode = 62
value = {org.hibernate.cache.spi.entry.StandardCacheEntryImpl@3858}
    disassembledState = {java.io.Serializable[2]@3862} 
        0 = {java.lang.Long@3867} "1"
        1 = {java.lang.String@3868} "Good post!"
    subclass = {java.lang.String@3863} "com.vladmihalcea.hibernate.masterclass.laboratory.cache.SecondLevelCacheTest$Comment"
    lazyPropertiesAreUnfetched = false
    version = null

The disassembled state contains the Post.id Foreign Key reference and the review column, therefore mirroring the associated database table definition.

Conclusion

The second-level cache is a relational data cache, so it stores data in a normalized form and each entity update affects only one cache entry. Reading a whole entity graph is not possible, since the entity associations are not materialized in the second-level cache entries.

An aggregated entity graph yields better performance for read operations at the cost of complicating write operations. If the cached data is not normalized and scattered across various aggregated models, an entity update would have to modify multiple cache entries, therefore affecting the write operations performance.

Because it mirrors the underlying relation data, the second-level cache offers various concurrency strategy mechanisms so we can balance read performance and strong consistency guarantees.

Code available on GitHub.

If you have enjoyed reading my article and you’re looking forward to getting instant email notifications of my latest posts, consider following my blog.

SELECT statements batch fetching with JDBC and Hibernate

Introduction

Now that I covered Hibernate batch support for INSERT, UPDATE and DELETE statements, it’s time to analyze SELECT statements result set batch fetching.

JDBC ResultSet fetching

The JDBC ResultSet offers a client-side Proxy cursor for fetching the current statement return data. When the statement gets executed, the result must be transferred from the database cursor to the client-side one. This operation can either be done at once or on demand.

There are three types of ResultSet cursors:

Cursor Type Description
TYPE_FORWARD_ONLY This is the default ResultSet cursor type. The result set can only be moved forward and the resulted data can either be fetched at once or retrieved while the cursor is being iterated. The database can decide to fetch the data as it was available at the time the query started or as it is upon fetching.
TYPE_SCROLL_INSENSITIVE The result set can be scrolled both forward and backward and the resulted data is insensitive to concurrent changes occurring while the cursor is still open
TYPE_SCROLL_SENSITIVE The result set can be scrolled both forward and backward and the resulted data is sensitive to concurrent changes occurring while the cursor is still open. The data is therefore fetched on demand as opposed to being retrieved from a database cursor cache

Not all database drivers implement all cursor types and the batch fetching behavior is controlled through the JDBC Statement fetchSize property, which according to the Javadoc:

Gives the JDBC driver a hint as to the number of rows that should be fetched from the database when more rows are needed for ResultSet objects generated by this Statement. If the value specified is zero, then the hint is ignored. The default value is zero.

The default fetching strategy is therefore database specific and from the application performance point of view, this aspect is very important when tuning the data access layer:

  • Oracle

    By default, when Oracle JDBC runs a query, it retrieves a result set of 10 rows at a time from the database cursor.

  • MySQL

    By default, ResultSets are completely retrieved and stored in memory. In most cases this is the most efficient way to operate and, due to the design of the MySQL network protocol, is easier to implement.

  • SQL Server

    Normally, when the Microsoft JDBC Driver for SQL Server executes a query, the driver retrieves all of the results from the server into application memory. Although this approach minimizes resource consumption on the SQL Server, it can throw an OutOfMemoryError in the JDBC application for the queries that produce very large results.

  • PostgreSQL

    By default the driver collects all the results for the query at once. This can be inconvenient for large data sets so the JDBC driver provides a means of basing a ResultSet on a database cursor and only fetching a small number of rows.

  • DB2

    By default the driver collects all the results for the query at once. This can be inconvenient for large data sets so the JDBC driver provides a means of basing a ResultSet on a database cursor and only fetching a small number of rows. The fetchSize property differs from the queryDataSize property. fetchSize affects the number of rows that are returned, and queryDataSize affects the number of bytes that are returned.

    For example, if the result set size is 50 KB, and the value of queryDataSize is 32767 (32KB), two trips to the database server are required to retrieve the result set. However, if queryDataSize is set to 65535 (64 KB), only one trip to the data source is required to retrieve the result set.

Hibernate ResultSet traversing options

The Java Persistence Query interface offers only full-result retrievals, through the Query.getResultList() method call.

Hibernate
also supports scrollable ResultSet cursors through its specific Query.scroll() API.

The only apparent advantage of scrollable ResultSets is that we can avoid memory issues on the client-side, since data is being fetched on demand. This might sound like a natural choice, but in reality, you shouldn’t fetch large result sets for the following reasons:

  • Large result sets impose significant database server resources and because a database is a highly concurrent environment, it might hinder availability and scalability
  • Tables tend to grow in size and a moderate result set might easily turn into a very large one. This kind of situation happens in production systems, long after the application code was shipped. Because users can only browse a relatively small portion of the whole result set, pagination is a more scalable data fetching alternative
  • The overly common offset paging is not suitable for large result sets (because the response time increases linearly with the page number) and you should consider keyset pagination when traversing large result sets. The keyset pagination offers a constant response time insensitive to the relative position of the page being fetched
  • Even for batch processing jobs, it’s always safer to restrict processing items to a moderate batch size. Large batches can lead to memory issues or cause long-running transactions, which increase the undo/redo transaction log size

Testing time

Our domain entity model looks like this:

PostCommentFetchSize

The following test will be use for validating various result set fetching behaviors:

@Test
public void testFetchSize() {
    doInTransaction(session -> {
        int batchSize = batchSize();
        for(int i = 0; i < itemsCount(); i++) {
            Post post = new Post(String.format(
                "Post no. %d", i));
            int j = 0;
            post.addComment(new Comment(
                    String.format(
                "Post comment %d:%d", i, j++)));
            post.addComment(new Comment(
                    String.format(
                "Post comment %d:%d", i, j++)));
            session.persist(post);
            if(i % batchSize == 0 && i > 0) {
                session.flush();
                session.clear();
            }
        }
    });

    long startNanos = System.nanoTime();
    LOGGER.info("Test fetch size");
    doInTransaction(session -> {
        List posts = session.createQuery(
            "select p " +
            "from Post p " +
            "join fetch p.comments ")
        .list();
        LOGGER.info("{}.fetched {} entities",
            getClass().getSimpleName(),
            posts.size());

    });
    LOGGER.info("{}.testFetch took {} millis",
        getClass().getSimpleName(),
        TimeUnit.NANOSECONDS.toMillis(
            System.nanoTime() - startNanos
    ));
}

To configure Hibernate to use an explicit Statement fetchSize, we need to set the following Hibernate property:

properties.put("hibernate.jdbc.fetch_size", fetchSize());

Every test will insert 5000 Post entities, each one having 2 Comments.

Oracle

The first tests are run against Oracle XE 11.2.0.4 with the following results:

Fetch Size Duration [millis]
1 1190
10 640
100 481
1000 459
10000 449
Default (10) 545

The larger the fetch size, the less round trips are required for fetching the whole result set. If the returned rows contain many columns, a larger fetch size would require proportionally larger database buffers. According to Oracle JDBC Driver documentation:

What is “reasonable” depends on the details of the application. Oracle suggests the fetchSize be no more than 100, although in some cases larger size may be appropriate. A fetchSize of 100 may be inappropriately large for some queries even when many rows are returned.

This guideline is validated by our test result, as response time doesn’t improve significantly for fetch sizes larger than 100.

PostgreSQL

The second test round is run against PostgreSQL 9.4 with the following results:

Fetch Size Duration [millis]
1 1181
10 572
100 485
1000 458
10000 437
Default (all) 396

The default fetch size yields the best result, even when the fetchSize is equal to the total number of rows being returned. Since there is no upper-bound buffer limit, the default fetch size can cause OutOfMemoryError issues when retrieving large result sets.

Conclusion

While most database serves don’t impose a default upper limit for the result set fetch size, it’s a good practice to limit the whole result set (if requirements allow it). A limited size result set should address the unbounded fetch size shortcoming, while ensuring predictable response times even when the queried data grows gradually. The shorter the queries, the quicker the row-level locks are released and the more scalable the data access layer becomes.

Code available on GitHub.

If you have enjoyed reading my article and you’re looking forward to getting instant email notifications of my latest posts, consider following my blog.

How to batch DELETE statements with Hibernate

Introduction

In my previous post, I explained the Hibernate configurations required for batching INSERT and UPDATE statements. This post will continue this topic with DELETE statements batching.

Domain model entities

We’ll start with the following entity model:

PostCommentDetailsBatchDelete

The Post entity has a one-to-many association to a Comment and a one-to-one relationship with the PostDetails entity:

@OneToMany(cascade = CascadeType.ALL, mappedBy = "post",
        orphanRemoval = true)
private List<Comment> comments = new ArrayList<>();

@OneToOne(cascade = CascadeType.ALL, mappedBy = "post",
        orphanRemoval = true, fetch = FetchType.LAZY)
private PostDetails details;

The up-coming tests will be run against the following data:

doInTransaction(session -> {
    int batchSize = batchSize();
    for(int i = 0; i < itemsCount(); i++) {
        int j = 0;

        Post post = new Post(String.format(
            "Post no. %d", i));        
        post.addComment(new Comment( String.format(
            "Post comment %d:%d", i, j++)));
        post.addComment(new Comment(String.format(
            "Post comment %d:%d", i, j++)));
        post.addDetails(new PostDetails());

        session.persist(post);
        if(i % batchSize == 0 && i > 0) {
            session.flush();
            session.clear();
        }
    }
});

Hibernate Configuration

As already explained, the following properties are required for batching INSERT and UPDATE statements:

properties.put("hibernate.jdbc.batch_size", 
    String.valueOf(batchSize()));
properties.put("hibernate.order_inserts", 
    "true");
properties.put("hibernate.order_updates", 
    "true");
properties.put("hibernate.jdbc.batch_versioned_data", 
    "true");

Next, we are going to check if DELETE statements are batched as well.

JPA Cascade Delete

Because cascading entity state transitions is convenient, I’m going to prove that CascadeType.DELETE and JDBC batching don’t mix well.

The following tests is going to:

  • Select some Posts along with Comments and PostDetails
  • Delete the Posts, while propagating the delete event to Comments and PostDetails as well
@Test
public void testCascadeDelete() {
    LOGGER.info("Test batch delete with cascade");
    final AtomicReference<Long> startNanos = 
        new AtomicReference<>();
    addDeleteBatchingRows();
    doInTransaction(session -> {
        List<Post> posts = session.createQuery(
            "select distinct p " +
            "from Post p " +
            "join fetch p.details d " +
            "join fetch p.comments c")
        .list();
        startNanos.set(System.nanoTime());
        for (Post post : posts) {
            session.delete(post);
        }
    });
    LOGGER.info("{}.testCascadeDelete took {} millis",
        getClass().getSimpleName(),
        TimeUnit.NANOSECONDS.toMillis(
            System.nanoTime() - startNanos.get()
    ));
}

Running this test gives the following output:

Query:{[delete from Comment where id=? and version=?][55,0]} {[delete from Comment where id=? and version=?][56,0]} 
Query:{[delete from PostDetails where id=?][3]} 
Query:{[delete from Post where id=? and version=?][3,0]} 
Query:{[delete from Comment where id=? and version=?][54,0]} {[delete from Comment where id=? and version=?][53,0]} 
Query:{[delete from PostDetails where id=?][2]} 
Query:{[delete from Post where id=? and version=?][2,0]} 
Query:{[delete from Comment where id=? and version=?][52,0]} {[delete from Comment where id=? and version=?][51,0]} 
Query:{[delete from PostDetails where id=?][1]} 
Query:{[delete from Post where id=? and version=?][1,0]}

Only the Comment DELETE statements were batched, the other entities being deleted in separate database round-trips.

The reason for this behaviour is given by the ActionQueue sorting implementation:

if ( session.getFactory().getSettings().isOrderUpdatesEnabled() ) {
    // sort the updates by pk
    updates.sort();
}
if ( session.getFactory().getSettings().isOrderInsertsEnabled() ) {
    insertions.sort();
}

While INSERTS and UPDATES are covered, DELETE statements are not sorted at all. A JDBC batch can only be reused when all statements belong to the same database table. When an incoming statement targets a different database table, the current batch has to be released, so that the new batch matches the current statement database table:

public Batch getBatch(BatchKey key) {
    if ( currentBatch != null ) {
        if ( currentBatch.getKey().equals( key ) ) {
            return currentBatch;
        }
        else {
            currentBatch.execute();
            currentBatch.release();
        }
    }
    currentBatch = batchBuilder().buildBatch(key, this);
    return currentBatch;
}

Orphan removal and manual flushing

A work-around is to dissociate all Child entities while manually flushing the Hibernate Session before advancing to a new Child association:

@Test
public void testOrphanRemoval() {
    LOGGER.info("Test batch delete with orphan removal");
    final AtomicReference<Long> startNanos = 
        new AtomicReference<>();

    addDeleteBatchingRows();

    doInTransaction(session -> {
        List<Post> posts = session.createQuery(
            "select distinct p " +
            "from Post p " +
            "join fetch p.details d " +
            "join fetch p.comments c")
        .list();

        startNanos.set(System.nanoTime());

        posts.forEach(Post::removeDetails);
        session.flush();

        posts.forEach(post -> {
            for (Iterator<Comment> commentIterator = 
                     post.getComments().iterator(); 
                     commentIterator.hasNext(); ) {
                Comment comment =  commentIterator.next();
                comment.post = null;
                commentIterator.remove();
            }
        });
        session.flush();

        posts.forEach(session::delete);
    });
    LOGGER.info("{}.testOrphanRemoval took {} millis",
        getClass().getSimpleName(),
        TimeUnit.NANOSECONDS.toMillis(
            System.nanoTime() - startNanos.get()
    ));
}

This time all DELETE statements are properly batched:

Query:{[delete from PostDetails where id=?][2]} {[delete from PostDetails where id=?][3]} {[delete from PostDetails where id=?][1]} 
Query:{[delete from Comment where id=? and version=?][53,0]} {[delete from Comment where id=? and version=?][54,0]} {[delete from Comment where id=? and version=?][56,0]} {[delete from Comment where id=? and version=?][55,0]} {[delete from Comment where id=? and version=?][52,0]} {[delete from Comment where id=? and version=?][51,
Query:{[delete from Post where id=? and version=?][2,0]} {[delete from Post where id=? and version=?][3,0]} {[delete from Post where id=? and version=?][1,0]} 

SQL Cascade Delete

A better solution is to use SQL cascade deletion, instead of JPA entity state propagation mechanism. This way, we can also reduce the DML statements count. Because Hibernate Session acts as a transactional write-behind cache, we must be extra cautious when mixing entity state transitions with database-side automatic actions, as the Persistence Context might not reflect the latest database changes.

The Post entity one-to-many Comment association is marked with the Hibernate specific @OnDelete annotation, so that the auto-generated database schema includes the ON DELETE CASCADE directive:

@OneToMany(cascade = {
       CascadeType.PERSIST, 
       CascadeType.MERGE}, 
    mappedBy = "post")
@OnDelete(action = OnDeleteAction.CASCADE)
private List<Comment> comments = new ArrayList<>();

Generating the following DDL:

alter table Comment add constraint 
FK_apirq8ka64iidc18f3k6x5tc5 foreign key (post_id) 
references Post on delete cascade

The same is done with the PostDetails entity one-to-one Post association:

@OneToOne(fetch = FetchType.LAZY)
@JoinColumn(name = "id")
@MapsId
@OnDelete(action = OnDeleteAction.CASCADE)
private Post post;

And the associated DDL:

alter table PostDetails add constraint 
FK_h14un5v94coafqonc6medfpv8 foreign key (id) 
references Post on delete cascade

The CascadeType.ALL and orphanRemoval were replaced with CascadeType.PERSIST and CascadeType.MERGE, because we no longer want Hibernate to propagate the entity removal event.

The test only deletes the Post entities.

doInTransaction(session -> {
    List<Post> posts = session.createQuery(
        "select p from Post p")
    .list();
    startNanos.set(System.nanoTime());
    for (Post post : posts) {
        session.delete(post);
    }
});

The DELETE statements are properly batched as there’s only one target table.

Query:{[delete from Post where id=? and version=?][1,0]} {[delete from Post where id=? and version=?][2,0]} {[delete from Post where id=? and version=?][3,0]}

Conclusion

If INSERT and UPDATE statements batching is just a matter of configuration, DELETE statements require some additional steps, which may increase the data access layer complexity.

Code available on GitHub.

If you have enjoyed reading my article and you’re looking forward to getting instant email notifications of my latest posts, consider following my blog.

How to batch INSERT and UPDATE statements with Hibernate

Introduction

JDBC has long been offering support for DML statement batching. By default, all statements are sent one after the other, each one in a separate network round-trip. Batching allows us to send multiple statements in one-shot, saving unnecessary socket stream flushing.

Hibernate hides the database statements behind a transactional write-behind abstraction layer. An intermediate layer allows us to hide the JDBC batching semantics from the persistence layer logic. This way, we can change the JDBC batching strategy without altering the data access code.

Configuring Hibernate to support JDBC batching is not as easy as it should be, so I’m going to explain everything you need to do in order to make it work.

Testing time

We’ll start with the following entity model:

PostCommentJdbcBatch

The Post has a one-to-many association with the Comment entity:

@OneToMany(
    cascade = CascadeType.ALL, 
    mappedBy = "post", 
    orphanRemoval = true)
private List<Comment> comments = new ArrayList<>();

Or test scenario issues both INSERT and UPDATE statements, so we can validate if JDBC batching is being used:

LOGGER.info("Test batch insert");
long startNanos = System.nanoTime();
doInTransaction(session -> {
    int batchSize = batchSize();
    for(int i = 0; i < itemsCount(); i++) {
        Post post = new Post(
            String.format("Post no. %d", i)
        );
        int j = 0;
        post.addComment(new Comment(
                String.format(
                    "Post comment %d:%d", i, j++
        )));
        post.addComment(new Comment(
                String.format(
                     "Post comment %d:%d", i, j++
        )));
        session.persist(post);
        if(i % batchSize == 0 && i > 0) {
            session.flush();
            session.clear();
        }
    }
});
LOGGER.info("{}.testInsert took {} millis",
    getClass().getSimpleName(),
    TimeUnit.NANOSECONDS.toMillis(
        System.nanoTime() - startNanos
    ));

LOGGER.info("Test batch update");
startNanos = System.nanoTime();

doInTransaction(session -> {
    List<Post> posts = session.createQuery(
        "select distinct p " +
        "from Post p " +
        "join fetch p.comments c")
    .list();

    for(Post post : posts) {
        post.title = "Blog " + post.title;
        for(Comment comment : post.comments) {
            comment.review = "Blog " + comment.review;
        }
    }
});

LOGGER.info("{}.testUpdate took {} millis",
    getClass().getSimpleName(),
    TimeUnit.NANOSECONDS.toMillis(
        System.nanoTime() - startNanos
    ));

This test will persist a configurable number of Post entities, each one containing two Comments. For the sake of brevity, we are going to persist 3 Posts and the Dialect default batch size:

protected int itemsCount() {
    return 3;
}

protected int batchSize() {
    return Integer.valueOf(Dialect.DEFAULT_BATCH_SIZE);
}

Default batch support

Hibernate doesn’t implicitly employ JDBC batching and each INSERT and UPDATE statement is executed separately:

Query:{[insert into Post (title, version, id) values (?, ?, ?)][Post no. 0,0,1]} 
Query:{[insert into Comment (post_id, review, version, id) values (?, ?, ?, ?)][1,Post comment 0:0,0,51]} 
Query:{[insert into Comment (post_id, review, version, id) values (?, ?, ?, ?)][1,Post comment 0:1,0,52]} 
Query:{[insert into Post (title, version, id) values (?, ?, ?)][Post no. 1,0,2]} 
Query:{[insert into Comment (post_id, review, version, id) values (?, ?, ?, ?)][2,Post comment 1:0,0,53]} 
Query:{[insert into Comment (post_id, review, version, id) values (?, ?, ?, ?)][2,Post comment 1:1,0,54]} 
Query:{[insert into Post (title, version, id) values (?, ?, ?)][Post no. 2,0,3]} 
Query:{[insert into Comment (post_id, review, version, id) values (?, ?, ?, ?)][3,Post comment 2:0,0,55]} 
Query:{[insert into Comment (post_id, review, version, id) values (?, ?, ?, ?)][3,Post comment 2:1,0,56]}

Query:{[update Post set title=?, version=? where id=? and version=?][Blog Post no. 1,1,2,0]} 
Query:{[update Comment set post_id=?, review=?, version=? where id=? and version=?][2,Blog Post comment 1:0,1,53,0]} 
Query:{[update Post set title=?, version=? where id=? and version=?][Blog Post no. 0,1,1,0]} 
Query:{[update Comment set post_id=?, review=?, version=? where id=? and version=?][1,Blog Post comment 0:1,1,52,0]} 
Query:{[update Post set title=?, version=? where id=? and version=?][Blog Post no. 2,1,3,0]} 
Query:{[update Comment set post_id=?, review=?, version=? where id=? and version=?][3,Blog Post comment 2:0,1,55,0]} 
Query:{[update Comment set post_id=?, review=?, version=? where id=? and version=?][3,Blog Post comment 2:1,1,56,0]} 
Query:{[update Comment set post_id=?, review=?, version=? where id=? and version=?][1,Blog Post comment 0:0,1,51,0]} 
Query:{[update Comment set post_id=?, review=?, version=? where id=? and version=?][2,Blog Post comment 1:1,1,54,0]} 

Configuring hibernate.jdbc.batch_size

To enable JDBC batching, we have to configure the hibernate.jdbc.batch_size property:

A non-zero value enables use of JDBC2 batch updates by Hibernate (e.g. recommended values between 5 and 30)

We’ll set this property and rerun our test:

properties.put("hibernate.jdbc.batch_size", 
    String.valueOf(batchSize()));

This time, the Comment INSERT statements are batched, while the UPDATE statements are left untouched:

Query:{[insert into Post (title, version, id) values (?, ?, ?)][Post no. 0,0,1]} 
Query:{[insert into Comment (post_id, review, version, id) values (?, ?, ?, ?)][1,Post comment 0:0,0,51]} {[insert into Comment (post_id, review, version, id) values (?, ?, ?, ?)][1,Post comment 0:1,0,52]} 
Query:{[insert into Post (title, version, id) values (?, ?, ?)][Post no. 1,0,2]} 
Query:{[insert into Comment (post_id, review, version, id) values (?, ?, ?, ?)][2,Post comment 1:0,0,53]} {[insert into Comment (post_id, review, version, id) values (?, ?, ?, ?)][2,Post comment 1:1,0,54]} 
Query:{[insert into Post (title, version, id) values (?, ?, ?)][Post no. 2,0,3]} 
Query:{[insert into Comment (post_id, review, version, id) values (?, ?, ?, ?)][3,Post comment 2:0,0,55]} {[insert into Comment (post_id, review, version, id) values (?, ?, ?, ?)][3,Post comment 2:1,0,56]}

Query:{[update Post set title=?, version=? where id=? and version=?][Blog Post no. 1,1,2,0]} 
Query:{[update Comment set post_id=?, review=?, version=? where id=? and version=?][2,Blog Post comment 1:0,1,53,0]}
Query:{[update Post set title=?, version=? where id=? and version=?][Blog Post no. 0,1,1,0]} 
Query:{[update Comment set post_id=?, review=?, version=? where id=? and version=?][1,Blog Post comment 0:1,1,52,0]}
Query:{[update Post set title=?, version=? where id=? and version=?][Blog Post no. 2,1,3,0]} 
Query:{[update Comment set post_id=?, review=?, version=? where id=? and version=?][3,Blog Post comment 2:0,1,55,0]}
Query:{[update Comment set post_id=?, review=?, version=? where id=? and version=?][3,Blog Post comment 2:1,1,56,0]}
Query:{[update Comment set post_id=?, review=?, version=? where id=? and version=?][1,Blog Post comment 0:0,1,51,0]}
Query:{[update Comment set post_id=?, review=?, version=? where id=? and version=?][2,Blog Post comment 1:1,1,54,0]}

A JDBC batch can target one table only, so every new DML statement targeting a different table ends up the current batch and initiates a new one. Mixing different table statements is therefore undesirable when using SQL batch processing.

Ordering statements

Hibernate can sort INSERT and UPDATE statements using the following configuration options:

properties.put("hibernate.order_inserts", "true");
properties.put("hibernate.order_updates", "true");

While the Post and Comment INSERT statements are batched accordingly, the UPDATE statements are still executed separately:

Query:{[insert into Post (title, version, id) values (?, ?, ?)][Post no. 0,0,1]} {[insert into Post (title, version, id) values (?, ?, ?)][Post no. 1,0,2]} {[insert into Post (title, version, id) values (?, ?, ?)][Post no. 2,0,3]} 
Query:{[insert into Comment (post_id, review, version, id) values (?, ?, ?, ?)][1,Post comment 0:0,0,51]} {[insert into Comment (post_id, review, version, id) values (?, ?, ?, ?)][1,Post comment 0:1,0,52]} {[insert into Comment (post_id, review, version, id) values (?, ?, ?, ?)][2,Post comment 1:0,0,53]} {[insert into Comment (post_id, review, version, id) values (?, ?, ?, ?)][2,Post comment 1:1,0,54]} {[insert into Comment (post_id, review, version, id) values (?, ?, ?, ?)][3,Post comment 2:0,0,55]} {[insert into Comment (post_id, review, version, id) values (?, ?, ?, ?)][3,Post comment 2:1,0,56]}

Query:{[update Comment set post_id=?, review=?, version=? where id=? and version=?][1,Blog Post comment 0:0,1,51,0]}
Query:{[update Comment set post_id=?, review=?, version=? where id=? and version=?][1,Blog Post comment 0:1,1,52,0]}
Query:{[update Comment set post_id=?, review=?, version=? where id=? and version=?][2,Blog Post comment 1:0,1,53,0]}
Query:{[update Comment set post_id=?, review=?, version=? where id=? and version=?][2,Blog Post comment 1:1,1,54,0]}
Query:{[update Comment set post_id=?, review=?, version=? where id=? and version=?][3,Blog Post comment 2:0,1,55,0]}
Query:{[update Comment set post_id=?, review=?, version=? where id=? and version=?][3,Blog Post comment 2:1,1,56,0]}
Query:{[update Post set title=?, version=? where id=? and version=?][Blog Post no. 0,1,1,0]} 
Query:{[update Post set title=?, version=? where id=? and version=?][Blog Post no. 1,1,2,0]} 
Query:{[update Post set title=?, version=? where id=? and version=?][Blog Post no. 2,1,3,0]} 

Adding version data batch support

There’s the hibernate.jdbc.batch_versioned_data configuration property we need to set, in order to enable UPDATE batching:

Set this property to true if your JDBC driver returns correct row counts from executeBatch(). It is usually safe to turn this option on. Hibernate will then use batched DML for automatically versioned data. Defaults to false.

We will rerun our test with this property set too:

properties.put("hibernate.jdbc.batch_versioned_data", "true");

Now both the INSERT and the UPDATE statements are properly batched:

Query:{[insert into Post (title, version, id) values (?, ?, ?)][Post no. 0,0,1]} {[insert into Post (title, version, id) values (?, ?, ?)][Post no. 1,0,2]} {[insert into Post (title, version, id) values (?, ?, ?)][Post no. 2,0,3]} 
Query:{[insert into Comment (post_id, review, version, id) values (?, ?, ?, ?)][1,Post comment 0:0,0,51]} {[insert into Comment (post_id, review, version, id) values (?, ?, ?, ?)][1,Post comment 0:1,0,52]} {[insert into Comment (post_id, review, version, id) values (?, ?, ?, ?)][2,Post comment 1:0,0,53]} {[insert into Comment (post_id, review, version, id) values (?, ?, ?, ?)][2,Post comment 1:1,0,54]} {[insert into Comment (post_id, review, version, id) values (?, ?, ?, ?)][3,Post comment 2:0,0,55]} {[insert into Comment (post_id, review, version, id) values (?, ?, ?, ?)][3,Post comment 2:1,0,56]}

Query:{[update Comment set post_id=?, review=?, version=? where id=? and version=?][1,Blog Post comment 0:0,1,51,0]} {[update Comment set post_id=?, review=?, version=? where id=? and version=?][1,Blog Post comment 0:1,1,52,0]} {[update Comment set post_id=?, review=?, version=? where id=? and version=?][2,Blog Post comment 1:0,1,53,0]} {[update Comment set post_id=?, review=?, version=? where id=? and version=?][2,Blog Post comment 1:1,1,54,0]} {[update Comment set post_id=?, review=?, version=? where id=? and version=?][3,Blog Post comment 2:0,1,55,0]} {[update Comment set post_id=?, review=?, version=? where id=? and version=?][3,Blog Post comment 2:1,1,56,0]}
Query:{[update Post set title=?, version=? where id=? and version=?][Blog Post no. 0,1,1,0]} {[update Post set title=?, version=? where id=? and version=?][Blog Post no. 1,1,2,0]} {[update Post set title=?, version=? where id=? and version=?][Blog Post no. 2,1,3,0]} 

Benchmark

Now that we managed to configure Hibernate for JDBC batching, we can benchmark the performance gain of statement grouping.

  • the test case uses a PostgreSQL database installed on the same machine with the currently running JVM
  • a batch size of 50 was chosen and each test iteration increases the statement count by an order of magnitude
  • all durations are expressed in milliseconds
Statement count No batch Insert duration No batch Update duration Batch Insert duration Batch Update duration
30 218 178 191 144
300 311 327 208 217
3000 1047 1089 556 478
30000 5889 6032 2640 2301
300000 51785 57869 16052 20954

The more rows we INSERT or UPDATE, the more we can benefit from JDBC batching. For write-most applications (e.g enterprise enterprise batch processors), we should definitely enable JDBC batching as the performance benefits might be staggering.

Code available on GitHub.

If you have enjoyed reading my article and you’re looking forward to getting instant email notifications of my latest posts, consider following my blog.

Hibernate CascadeType.LOCK gotchas

Introduction

Having introduced Hibernate explicit locking support, as well as Cascade Types, it’s time to analyze the CascadeType.LOCK behavior.

A Hibernate lock request triggers an internal LockEvent. The associated DefaultLockEventListener may cascade the lock request to the locking entity children.

Since CascadeType.ALL includes CascadeType.LOCK too, it’s worth understanding when a lock request propagates from a Parent to a Child entity.

Testing time

We’ll start with the following entity model:

PostCommentDetailsLockCascade

The Post is the Parent entity of both the PostDetail one-to-one association and the Comment one-to-many relationship, and these associations are marked with CascadeType.ALL:

@OneToMany(
    cascade = CascadeType.ALL, 
    mappedBy = "post", 
    orphanRemoval = true)
private List<Comment> comments = new ArrayList<>();

@OneToOne(
    cascade = CascadeType.ALL, 
    mappedBy = "post",
    fetch = FetchType.LAZY)
private PostDetails details;

All the up-coming test cases will use the following entity model graph:

doInTransaction(session -> {
    Post post = new Post();
    post.setName("Hibernate Master Class");

    post.addDetails(new PostDetails());
    post.addComment(new Comment("Good post!"));
    post.addComment(new Comment("Nice post!"));

    session.persist(post);
});

Locking managed entities

A managed entity is loaded in the current running Persistence Context and all entity state changes are translated to DML statements.

When a managed Parent entity is being locked:

doInTransaction(session -> {
    Post post = (Post) session.createQuery(
        "select p " +
        "from Post p " +
        "join fetch p.details " +
        "where " +
        "   p.id = :id")
    .setParameter("id", 1L)
    .uniqueResult();
    session.buildLockRequest(
        new LockOptions(LockMode.PESSIMISTIC_WRITE))
    .lock(post);
});

Only the Parent entity gets locked, the cascade being therefore prevented:

select id from Post where id = 1 for update

Hibernate defines a scope LockOption, which (according to JavaDocs) should allow a lock request to be propagated to Child entities:

“scope” is a JPA defined term. It is basically a cascading of the lock to associations.

session.buildLockRequest(
    new LockOptions(LockMode.PESSIMISTIC_WRITE))
.setScope(true)
.lock(post);

Setting the scope flag doesn’t change anything, only the managed entity being locked:

select id from Post where id = 1 for update

Locking detached entities

Apart from entity locking, the lock request can reassociate detached entities too. To prove this, we are going to check the Post entity graph before and after the lock entity request:

void containsPost(Session session, 
    Post post, boolean expected) {
    assertEquals(expected, 
        session.contains(post));
    assertEquals(expected, 
        session.contains(post.getDetails()));
    for(Comment comment : post.getComments()) {
        assertEquals(expected, 
            session.contains(comment));
    }
}

The following test demonstrates how CascadeType.LOCK works for detached entities:

//Load the Post entity, which will become detached
Post post = doInTransaction(session -> 
   (Post) session.createQuery(
        "select p " +
        "from Post p " +
        "join fetch p.details " +
        "join fetch p.comments " +
        "where " +
        "   p.id = :id")
.setParameter("id", 1L)
.uniqueResult());

//Change the detached entity state
post.setName("Hibernate Training");
doInTransaction(session -> {
    //The Post entity graph is detached
    containsPost(session, post, false);

    //The Lock request associates 
    //the entity graph and locks the requested entity
    session.buildLockRequest(
        new LockOptions(LockMode.PESSIMISTIC_WRITE))
    .lock(post);
    
    //Hibernate doesn't know if the entity is dirty
    assertEquals("Hibernate Training", 
        post.getName());

    //The Post entity graph is attached
    containsPost(session, post, true);
});
doInTransaction(session -> {
    //The detached Post entity changes have been lost
    Post _post = (Post) session.get(Post.class, 1L);
    assertEquals("Hibernate Master Class", 
        _post.getName());
});

The lock request reassociates the entity graph, but the current running Hibernate Session is unaware the entity became dirty, while in detached state. The entity is just reattached without forcing an UPDATE, or selecting the current database state for further comparison.

Once the entity becomes managed, any further change will be detected by the dirty checking mechanism and the flush will propagate the ante-reattachment changes as well. If no change happens while the entity is managed, the entity will not be scheduled for flushing.

If we want to make sure, the detached entity state is always synchronized with the database, we need to use merge or update.

The detached entities propagate the lock options, when the scope option is set to true:

session.buildLockRequest(
    new LockOptions(LockMode.PESSIMISTIC_WRITE))
.setScope(true)
.lock(post);

The Post entity lock event is propagated to all Child entities (since we are using CascadeType.ALL):

select id from Comment where id = 1 for update
select id from Comment where id = 2 for update
select id from PostDetails where id = 1 for update
select id from Post where id = 1 for update

Conclusion

The lock cascading is far from being straight-forward or intuitive. Explicit locking requires diligence (the more locks we acquire, the greater the chance of dead-locking) and you are better off retaining full-control over Child entity lock propagation anyway. Analogous to concurrency programming best practices, manual locking is therefore preferred over automatic lock propagation.

Code available on GitHub.

If you have enjoyed reading my article and you’re looking forward to getting instant email notifications of my latest posts, consider following my blog.