The best way to use the JPQL DISTINCT keyword with JPA and Hibernate

Last modified:

Imagine having a tool that can automatically detect JPA and Hibernate performance issues. Wouldn’t that be just awesome?

Well, Hypersistence Optimizer is that tool! And it works with Spring Boot, Spring Framework, Jakarta EE, Java EE, Quarkus, or Play Framework.

So, enjoy spending your time on the things you love rather than fixing performance issues in your production system on a Saturday night!

Introduction

In this article, we are going to how the JPQL DISTINCT keyword behaves depending on the underlying entity query type.

Domain Model

Let’s assume we are using the following Post and PostComment entities in our application:

The Post entity is mapped as follows:

@Entity(name = "Post")
@Table(name = "post")
public class Post {

    @Id
    @GeneratedValue
    private Long id;

    private String title;

    @Column(name = "created_on")
    private LocalDate createdOn;

    @OneToMany(
        mappedBy = "post",
        cascade = CascadeType.ALL,
        orphanRemoval = true
    )
    private List<PostComment> comments = new ArrayList<>();

    //Getters and setters omitted for brevity

    public void addComment(PostComment comment) {
        comments.add(comment);
        comment.setPost(this);
    }
}

The addComment method is useful for synchronizing both ends of the bidirectional @OneToMany association. For more details, check out this article.

And the PostComment entity looks as follows:

@Entity(name = "PostComment")
@Table(name = "post_comment")
public class PostComment {

    @Id
    @GeneratedValue
    private Long id;

    @ManyToOne(fetch = FetchType.LAZY)
    private Post post;

    private String review;

    public PostComment() {}

    public PostComment(String review) {
        this.review = review;
    }

    //Getters and setters omitted for brevity
}

By default, all @ManyToOne and @OneToOne associations are to be fetched eagerly, which is most often a terrible idea. That’s the reason we used the FetchType.LAZY strategy for the @ManyToOne association.

Test data

Now, let’s add some test data which we are going to use to demonstrate how the DISTINCT keyword works based on the underlying JPQL query type:

Post post1 = new Post();

post1.setTitle(
    "High-Performance Java Persistence eBook has been released!"
);
post1.setCreatedOn(LocalDate.of(2016, 8, 30));

entityManager.persist(post1);

post1.addComment(new PostComment("Excellent!"));
post1.addComment(new PostComment("Great!"));

Post post2 = new Post();

post2.setTitle(
    "High-Performance Java Persistence paperback has been released!"
);
post2.setCreatedOn(LocalDate.of(2016, 10, 12));

entityManager.persist(post2);

Post post3 = new Post();

post3.setTitle(
    "High-Performance Java Persistence Mach 1 video course has been released!"
);
post3.setCreatedOn(LocalDate.of(2018, 1, 30));

entityManager.persist(post3);

Post post4 = new Post();

post4.setTitle(
    "High-Performance Java Persistence Mach 2 video course has been released!"
);
post4.setCreatedOn(LocalDate.of(2018, 5, 8));

entityManager.persist(post4);

DISTINCT with JPQL scalar queries

When using a scalar projection like the one in the following example:

List<Integer> publicationYears = entityManager.createQuery("""
    select distinct year(p.createdOn)
    from Post p
    order by year(p.createdOn)
	""", Integer.class)
.getResultList();

LOGGER.info("Publication years: {}", publicationYears);

The DISTINCT keyword is needed to be passed to the underlying SQL statement, and Hibernate is going to output the following result:

SELECT DISTINCT 
    extract(YEAR FROM p.created_on) AS col_0_0_
FROM 
    post p
ORDER BY 
    extract(YEAR FROM p.created_on)

-- Publication years: [2016, 2018]

Therefore, for scalar queries, the DISTINCT JPQL keyword is needed to be passed to the underlying SQL query since we want the result set to remove duplicates.

Hibernate 5 and using DISTINCT with JPQL entity queries

When using Hibernate 5, entity queries that JOIN FETCH one-to-many or many-to-many collections can duplicate the parent entity Object references.

To visualize this behavior, consider the following JPQL query:

List<Post> posts = entityManager.createQuery("""
    select p
    from Post p
    left join fetch p.comments
    where p.title = :title
    """, Post.class)
.setParameter(
    "title", 
    "High-Performance Java Persistence eBook has been released!"
)
.getResultList();

LOGGER.info(
    "Fetched the following Post entity identifiers: {}", 
    posts.stream().map(Post::getId).collect(Collectors.toList())
);

When running the JPQL query above with Hibernate 5, we get the following output:

SELECT p.id AS id1_0_0_,
       pc.id AS id1_1_1_,
       p.created_on AS created_2_0_0_,
       p.title AS title3_0_0_,
       pc.post_id AS post_id3_1_1_,
       pc.review AS review2_1_1_,
       pc.post_id AS post_id3_1_0__
FROM   post p
LEFT OUTER JOIN 
       post_comment pc ON p.id=pc.post_id
WHERE 
       p.title='High-Performance Java Persistence eBook has been released!'

-- Fetched the following Post entity identifiers: [1, 1]

As illustrated by the log message, the returned posts List contains two references of the same Post entity object. This is because the JOIN duplicates the parent record for every child row that’s going to be fetched.

To remove the entity reference duplicates, we need to use the DISTINCT JPQL keyword:

List<Post> posts = entityManager.createQuery("""
    select distinct p
    from Post p
    left join fetch p.comments
    where p.title = :title
    """, Post.class)
.setParameter(
    "title", 
    "High-Performance Java Persistence eBook has been released!"
)
.getResultList();

LOGGER.info(
    "Fetched the following Post entity identifiers: {}", 
    posts.stream().map(Post::getId).collect(Collectors.toList())
);

When executing the JPQL query above, Hibernate will now generate the following output:

SELECT DISTINCT
       p.id AS id1_0_0_,
       pc.id AS id1_1_1_,
       p.created_on AS created_2_0_0_,
       p.title AS title3_0_0_,
       pc.post_id AS post_id3_1_1_,
       pc.review AS review2_1_1_,
       pc.post_id AS post_id3_1_0__
FROM   post p
LEFT OUTER JOIN 
       post_comment pc ON p.id=pc.post_id
WHERE 
       p.title='High-Performance Java Persistence eBook has been released!'

-- Fetched the following Post entity identifiers: [1]

For this SQL query, the DISTINCT keyword serves no purpose since the result set will contain unique parent-child records. The duplication didn’t come from the SQL query result set but from the parent Java Object references.

So, by using DISTINCT with entity queries, the duplicates are removed from the List in Java after the query is executed and prior to returning the List back to the caller.

However, the JPQL DISTINCT keyword was passed to the underlying SQL statement, and if we analyze the execution plan for the previous SQL statement, we can see that a quicksort execution is being added to the plan:

Unique  (cost=23.71..23.72 rows=1 width=1068) (actual time=0.131..0.132 rows=2 loops=1)
  ->  Sort  (cost=23.71..23.71 rows=1 width=1068) (actual time=0.131..0.131 rows=2 loops=1)
        Sort Key: p.id, pc.id, p.created_on, pc.post_id, pc.review
        Sort Method: quicksort  Memory: 25kB
        ->  Hash Right Join  (cost=11.76..23.70 rows=1 width=1068) (actual time=0.054..0.058 rows=2 loops=1)
              Hash Cond: (pc.post_id = p.id)
              ->  Seq Scan on post_comment pc  (cost=0.00..11.40 rows=140 width=532) (actual time=0.010..0.010 rows=2 loops=1)
              ->  Hash  (cost=11.75..11.75 rows=1 width=528) (actual time=0.027..0.027 rows=1 loops=1)
                    Buckets: 1024  Batches: 1  Memory Usage: 9kB
                    ->  Seq Scan on post p  (cost=0.00..11.75 rows=1 width=528) (actual time=0.017..0.018 rows=1 loops=1)
                          Filter: ((title)::text = 'High-Performance Java Persistence eBook has been released!'::text)
                          Rows Removed by Filter: 3

Planning time: 0.227 ms
Execution time: 0.179 ms

The quicksort execution adds an unneeded overhead to our statement execution since we don’t need to eliminate any duplicates since the result set contains unique parent-child row combinations.

Using the hibernate.query.passDistinctThrough JPQL query hint with Hibernate 5

To avoid passing the DISTINCT keyword to the underlying SQL statement, we need to activate the hibernate.query.passDistinctThrough JPQL query hint, as illustrated by the following example:

List<Post> posts = entityManager.createQuery("""
    select distinct p
    from Post p
    left join fetch p.comments
    where p.title = :title
    """, Post.class)
.setParameter(
    "title", 
    "High-Performance Java Persistence eBook has been released!"
)
.setHint("hibernate.query.passDistinctThrough", false)
.getResultList();

LOGGER.info(
    "Fetched the following Post entity identifiers: {}", 
    posts.stream().map(Post::getId).collect(Collectors.toList())
);

When running the JPQL with the hibernate.query.passDistinctThrough hint activated, Hibernate 5 executes the following SQL query:

SELECT
       p.id AS id1_0_0_,
       pc.id AS id1_1_1_,
       p.created_on AS created_2_0_0_,
       p.title AS title3_0_0_,
       pc.post_id AS post_id3_1_1_,
       pc.review AS review2_1_1_,
       pc.post_id AS post_id3_1_0__
FROM   post p
LEFT OUTER JOIN 
       post_comment pc ON p.id=pc.post_id
WHERE 
       p.title='High-Performance Java Persistence eBook has been released!'

-- Fetched the following Post entity identifiers: [1]

Therefore, the DISTINCT keyword is no longer passed to the SQL query, but entity duplicates are removed from the returning posts List.

If we analyze the execution plan for the last SQL query, we can see that the quicksort execution is no longer added to the execution plan:

Hash Right Join  (cost=11.76..23.70 rows=1 width=1068) (actual time=0.066..0.069 rows=2 loops=1)
  Hash Cond: (pc.post_id = p.id)
  ->  Seq Scan on post_comment pc  (cost=0.00..11.40 rows=140 width=532) (actual time=0.011..0.011 rows=2 loops=1)
  ->  Hash  (cost=11.75..11.75 rows=1 width=528) (actual time=0.041..0.041 rows=1 loops=1)
        Buckets: 1024  Batches: 1  Memory Usage: 9kB
        ->  Seq Scan on post p  (cost=0.00..11.75 rows=1 width=528) (actual time=0.036..0.037 rows=1 loops=1)
              Filter: ((title)::text = 'High-Performance Java Persistence eBook has been released!'::text)
              Rows Removed by Filter: 3

Planning time: 1.184 ms
Execution time: 0.160 ms

Keep in mind that the hibernate.query.passDistinctThrough is only needed in Hibernate 5.

In Hibernate 6, this hint was removed since you no longer need to use DISTINCT to duplicate Java object references.

Hibernate 6 auto-deduplication

Hibernate 6 can deduplicate parent entity references automatically, so you don’t need to use the DISTINCT keyword, as it were the case with Hibernate 5.

Therefore, when running the following query:

List<Post> posts = entityManager.createQuery("""
    select p
    from Post p
    left join fetch p.comments
    where p.title = :title
    """, Post.class)
.setParameter(
    "title", 
    "High-Performance Java Persistence eBook has been released!"
)
.getResultList();

assertEquals(1, posts.size());
assertEquals(2, posts.get(0).getComments().size());

We can see that a single Post entity was fetched even if it has two associated PostComment child entities.

I'm running an online workshop on the 20-21 and 23-24 of November about High-Performance Java Persistence.

If you enjoyed this article, I bet you are going to love my Book and Video Courses as well.

Conclusion

Because the DISTINCT JPQL keyword has two meanings based on the underlying query type, it’s important to pass it through to the SQL statement only for scalar queries where the result set requires duplicates to be removed by the database engine.

For parent-child entity queries where the child collection is using JOIN FETCH, the DISTINCT keyword should only be applied after the ResultSet is got from JDBC, therefore avoiding passing DISTINCT to the SQL statement that gets executed.