Merge Join Algorithm

Imagine having a tool that can automatically detect JPA and Hibernate performance issues. Hypersistence Optimizer is that tool!

Introduction

In this article, we are going to see how the Merge Join Algorithm, also known as Sort-Merge Join, works and when it’s suitable for a relational database system to employ it in order to execute an SQL JOIN query.

Data Sets

Let’s consider we have a parent Post and a child PostComment relations, looking as follows:

Merge Join Relations

The two entities form a one-to-many relationship because the postId attribute in the PostComment relation references the id attribute in the parent Post relation.

The Post entity has an associated post table with 1000 records that look as follows:

| id   | title         |
|------|---------------|
| 1    | Post no. 1    |
| 2    | Post no. 2    |
| ..   | ..            |
| 999  | Post no. 999  |
| 1000 | Post no. 1000 |

And, the child PostComment entity has 10,000 rows that are associated to the 1000 post records via the postId attribute:

| id    | review            | postId  |
|-------|-------------------|---------|
| 1     | Comment no. 1     | 1       |
| 2     | Comment no. 2     | 1       |
| ..    | ..                | ..      |
| 9999  | Comment no. 9999  | 1000    |
| 10000 | Comment no. 10000 | 1000    |

We are interested in joining the Post and PostComment records by matching the id attribute of the Post relation with the postId attribute of the PostComment relation so that we can build a projection that contains the following attributes:

  • the Post identifier
  • the Post title
  • the PostComment review

In our case, this is how the aforementioned report should look like:

| post_id | post_title    | review            |
|---------|---------------|-------------------|
| 1000    | Post no. 1000 | Comment no. 10000 |
| 1000    | Post no. 1000 | Comment no. 9999  |
| 1000    | Post no. 1000 | Comment no. 9998  |
| 1000    | Post no. 1000 | Comment no. 9997  |
| 1000    | Post no. 1000 | Comment no. 9996  |
| 1000    | Post no. 1000 | Comment no. 9995  |
| 1000    | Post no. 1000 | Comment no. 9994  |
| 1000    | Post no. 1000 | Comment no. 9993  |
| 1000    | Post no. 1000 | Comment no. 9992  |
| 1000    | Post no. 1000 | Comment no. 9991  |
| ..      |..             | ..                |
| 1       | Post no. 1    | Comment no. 2     |
| 1       | Post no. 1    | Comment no. 1     |

Merge Join Algorithm

The Merge Join Algorithm consists of two steps. In the first step, it needs to sort the two tables by the join attribute.

posts.sort(Comparator.comparing(Post::getId));

postComments.sort((pc1, pc2) -> {
    int result = Comparator
        .comparing(PostComment::getPostId)
        .compare(pc1, pc2);
    
    return result != 0 ? result : Comparator
        .comparing(PostComment::getId)
        .compare(pc1, pc2);
});

In the second step, we iterate the two tables and check the join condition.

List<Tuple> tuples = new ArrayList<>();

int postCount = posts.size(), postCommentCount = postComments.size();
int i = 0, j = 0;

while(i < postCount && j < postCommentCount) {
    Post post = posts.get(i);
    PostComment postComment = postComments.get(j);
    
    if(post.getId().equals(postComment.getPostId())) {
        tuples.add(
            new Tuple()
                .add("post_id", postComment.getPostId())
                .add("post_title", post.getTitle())
                .add("review", postComment.getReview())
        );
        j++;
    } else {
        i++;
    }
}

Unlike the Nested Loops or the Hash Join algorithms, the complexity of the Merge Join algorithm is log-star n (e.g., O(nlog(n) + mlog(m))), as illustrated by the following graph:

Merge Join Algorithm Complexity

The Merge Join algorithm may be used by relational database systems when the joining relations have an index, hence there is no need to sort the relation as the index can be used to read the records in the desired sorted order.

For instance, running this SQL query on PostgreSQL when joining the a post table with 1000 records and a post_comment table with 10,000 rows:

SELECT
   p.id AS post_id,
   p.title AS post_title,
   pc.review  AS review
FROM post p
INNER JOIN post_comment pc ON pc.post_id = p.id
ORDER BY pc.post_id DESC

produces a Merge Join, as illustrated by the underlying execution plan:

Merge Join  
  (cost=0.56..793.06 rows=10000 width=1048) 
  (actual time=0.621..8.986 rows=10000 loops=1)
  Merge Cond: (p.id = pc.post_id)
  ->  Index Scan Backward using idx_post_id on post p  
        (cost=0.28..63.27 rows=1000 width=524) 
        (actual time=0.402..0.798 rows=1000 loops=1)
  ->  Index Scan Backward using idx_post_comment_post_id on post_comment pc  
        (cost=0.29..602.28 rows=10000 width=524) 
        (actual time=0.167..4.583 rows=10000 loops=1)

If you enjoyed this article, I bet you are going to love my Book and Video Courses as well.

Conclusion

The Merge Join algorithm is used by relational database systems when joining larger tables in the order provided by the joining columns, as using the Nested Loops algorithm would have a much higher cost, and using the Hash Join algorithm would require an extra sorting step.

While Oracle, SQL Server, and PostgreSQL support the Merge Join algorithm, MySQL doesn’t support it yet.

Transactions and Concurrency Control eBook

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.