Merge Join Algorithm

Last modified:

Are you struggling with performance issues in your Spring, Jakarta EE, or Java EE application?

What if there were a tool that could automatically detect what caused performance issues in your JPA and Hibernate data access layer?

Wouldn’t it be awesome to have such a tool to watch your application and prevent performance issues during development, long before they affect production systems?

Well, Hypersistence Optimizer is that tool! And it works with Spring Boot, Spring Framework, Jakarta EE, Java EE, Quarkus, Micronaut, or Play Framework.

So, rather than fixing performance issues in your production system on a Saturday night, you are better off using Hypersistence Optimizer to help you prevent those issues so that you can spend your time on the things that you love!

Introduction

In this article, we are going to see how the Merge Join Algorithm, also known as Sort-Merge Join, works and when it’s suitable for a relational database system to employ it in order to execute an SQL JOIN query.

Data Sets

Let’s consider we have a parent Post and a child PostComment relations, looking as follows:

Merge Join Relations

The two entities form a one-to-many relationship because the postId attribute in the PostComment relation references the id attribute in the parent Post relation.

The Post entity has an associated post table with 1000 records that look as follows:

| id   | title         |
|------|---------------|
| 1    | Post no. 1    |
| 2    | Post no. 2    |
| ..   | ..            |
| 999  | Post no. 999  |
| 1000 | Post no. 1000 |

And, the child PostComment entity has 10,000 rows that are associated to the 1000 post records via the postId attribute:

| id    | review            | postId  |
|-------|-------------------|---------|
| 1     | Comment no. 1     | 1       |
| 2     | Comment no. 2     | 1       |
| ..    | ..                | ..      |
| 9999  | Comment no. 9999  | 1000    |
| 10000 | Comment no. 10000 | 1000    |

We are interested in joining the Post and PostComment records by matching the id attribute of the Post relation with the postId attribute of the PostComment relation so that we can build a projection that contains the following attributes:

the Post identifier
the Post title
the PostComment review

In our case, this is how the aforementioned report should look like:

| post_id | post_title    | review            |
|---------|---------------|-------------------|
| 1000    | Post no. 1000 | Comment no. 10000 |
| 1000    | Post no. 1000 | Comment no. 9999  |
| 1000    | Post no. 1000 | Comment no. 9998  |
| 1000    | Post no. 1000 | Comment no. 9997  |
| 1000    | Post no. 1000 | Comment no. 9996  |
| 1000    | Post no. 1000 | Comment no. 9995  |
| 1000    | Post no. 1000 | Comment no. 9994  |
| 1000    | Post no. 1000 | Comment no. 9993  |
| 1000    | Post no. 1000 | Comment no. 9992  |
| 1000    | Post no. 1000 | Comment no. 9991  |
| ..      |..             | ..                |
| 1       | Post no. 1    | Comment no. 2     |
| 1       | Post no. 1    | Comment no. 1     |

Merge Join Algorithm

The Merge Join Algorithm consists of two steps. In the first step, it needs to sort the two tables by the join attribute.

posts.sort(Comparator.comparing(Post::getId));

postComments.sort((pc1, pc2) -> {
    int result = Comparator
        .comparing(PostComment::getPostId)
        .compare(pc1, pc2);
    
    return result != 0 ? result : Comparator
        .comparing(PostComment::getId)
        .compare(pc1, pc2);
});

In the second step, we iterate the two tables and check the join condition.

List<Tuple> tuples = new ArrayList<>();

int postCount = posts.size(), postCommentCount = postComments.size();
int i = 0, j = 0;

while(i < postCount && j < postCommentCount) {
    Post post = posts.get(i);
    PostComment postComment = postComments.get(j);
    
    if(post.getId().equals(postComment.getPostId())) {
        tuples.add(
            new Tuple()
                .add("post_id", postComment.getPostId())
                .add("post_title", post.getTitle())
                .add("review", postComment.getReview())
        );
        j++;
    } else {
        i++;
    }
}

Unlike the Nested Loops or the Hash Join algorithms, the complexity of the Merge Join algorithm is log-star n (e.g., O(nlog(n) + mlog(m))), as illustrated by the following graph:

Merge Join Algorithm Complexity

The Merge Join algorithm may be used by relational database systems when the joining relations have an index, hence there is no need to sort the relation as the index can be used to read the records in the desired sorted order.

For instance, running this SQL query on PostgreSQL when joining the a post table with 1000 records and a post_comment table with 10,000 rows:

SELECT
   p.id AS post_id,
   p.title AS post_title,
   pc.review  AS review
FROM post p
INNER JOIN post_comment pc ON pc.post_id = p.id
ORDER BY pc.post_id DESC

produces a Merge Join, as illustrated by the underlying execution plan:

Merge Join  
  (cost=0.56..793.06 rows=10000 width=1048) 
  (actual time=0.621..8.986 rows=10000 loops=1)
  Merge Cond: (p.id = pc.post_id)
  ->  Index Scan Backward using idx_post_id on post p  
        (cost=0.28..63.27 rows=1000 width=524) 
        (actual time=0.402..0.798 rows=1000 loops=1)
  ->  Index Scan Backward using idx_post_comment_post_id on post_comment pc  
        (cost=0.29..602.28 rows=10000 width=524) 
        (actual time=0.167..4.583 rows=10000 loops=1)

If you enjoyed this article, I bet you are going to love my Book and Video Courses as well.

Conclusion

The Merge Join algorithm is used by relational database systems when joining larger tables in the order provided by the joining columns, as using the Nested Loops algorithm would have a much higher cost, and using the Hash Join algorithm might require an extra sorting step at the end if the records produced by the Hash Join do not follow the same sorting criteria required by the ORDER BY clause.

While Oracle, SQL Server, and PostgreSQL support the Merge Join algorithm, MySQL doesn’t support it yet.

Follow @vlad_mihalcea