Merge Join Algorithm
Imagine having a tool that can automatically detect JPA and Hibernate performance issues. Wouldn’t that be just awesome?
Well, Hypersistence Optimizer is that tool! And it works with Spring Boot, Spring Framework, Jakarta EE, Java EE, Quarkus, or Play Framework.
So, enjoy spending your time on the things you love rather than fixing performance issues in your production system on a Saturday night!
Introduction
In this article, we are going to see how the Merge Join Algorithm, also known as Sort-Merge Join, works and when it’s suitable for a relational database system to employ it in order to execute an SQL JOIN query.
Data Sets
Let’s consider we have a parent Post
and a child PostComment
relations, looking as follows:
The two entities form a one-to-many relationship because the postId
attribute in the PostComment
relation references the id
attribute in the parent Post
relation.
The Post
entity has an associated post
table with 1000 records that look as follows:
| id | title | |------|---------------| | 1 | Post no. 1 | | 2 | Post no. 2 | | .. | .. | | 999 | Post no. 999 | | 1000 | Post no. 1000 |
And, the child PostComment
entity has 10,000 rows that are associated to the 1000 post
records via the postId
attribute:
| id | review | postId | |-------|-------------------|---------| | 1 | Comment no. 1 | 1 | | 2 | Comment no. 2 | 1 | | .. | .. | .. | | 9999 | Comment no. 9999 | 1000 | | 10000 | Comment no. 10000 | 1000 |
We are interested in joining the Post
and PostComment
records by matching the id
attribute of the Post
relation with the postId
attribute of the PostComment
relation so that we can build a projection that contains the following attributes:
- the
Post
identifier - the
Post
title - the
PostComment
review
In our case, this is how the aforementioned report should look like:
| post_id | post_title | review | |---------|---------------|-------------------| | 1000 | Post no. 1000 | Comment no. 10000 | | 1000 | Post no. 1000 | Comment no. 9999 | | 1000 | Post no. 1000 | Comment no. 9998 | | 1000 | Post no. 1000 | Comment no. 9997 | | 1000 | Post no. 1000 | Comment no. 9996 | | 1000 | Post no. 1000 | Comment no. 9995 | | 1000 | Post no. 1000 | Comment no. 9994 | | 1000 | Post no. 1000 | Comment no. 9993 | | 1000 | Post no. 1000 | Comment no. 9992 | | 1000 | Post no. 1000 | Comment no. 9991 | | .. |.. | .. | | 1 | Post no. 1 | Comment no. 2 | | 1 | Post no. 1 | Comment no. 1 |
Merge Join Algorithm
The Merge Join Algorithm consists of two steps. In the first step, it needs to sort the two tables by the join attribute.
posts.sort(Comparator.comparing(Post::getId)); postComments.sort((pc1, pc2) -> { int result = Comparator .comparing(PostComment::getPostId) .compare(pc1, pc2); return result != 0 ? result : Comparator .comparing(PostComment::getId) .compare(pc1, pc2); });
In the second step, we iterate the two tables and check the join condition.
List<Tuple> tuples = new ArrayList<>(); int postCount = posts.size(), postCommentCount = postComments.size(); int i = 0, j = 0; while(i < postCount && j < postCommentCount) { Post post = posts.get(i); PostComment postComment = postComments.get(j); if(post.getId().equals(postComment.getPostId())) { tuples.add( new Tuple() .add("post_id", postComment.getPostId()) .add("post_title", post.getTitle()) .add("review", postComment.getReview()) ); j++; } else { i++; } }
Unlike the Nested Loops or the Hash Join algorithms, the complexity of the Merge Join algorithm is log-star n (e.g., O(nlog(n) + mlog(m))
), as illustrated by the following graph:
The Merge Join algorithm may be used by relational database systems when the joining relations have an index, hence there is no need to sort the relation as the index can be used to read the records in the desired sorted order.
For instance, running this SQL query on PostgreSQL when joining the a post
table with 1000 records and a post_comment
table with 10,000 rows:
SELECT p.id AS post_id, p.title AS post_title, pc.review AS review FROM post p INNER JOIN post_comment pc ON pc.post_id = p.id ORDER BY pc.post_id DESC
produces a Merge Join, as illustrated by the underlying execution plan:
Merge Join (cost=0.56..793.06 rows=10000 width=1048) (actual time=0.621..8.986 rows=10000 loops=1) Merge Cond: (p.id = pc.post_id) -> Index Scan Backward using idx_post_id on post p (cost=0.28..63.27 rows=1000 width=524) (actual time=0.402..0.798 rows=1000 loops=1) -> Index Scan Backward using idx_post_comment_post_id on post_comment pc (cost=0.29..602.28 rows=10000 width=524) (actual time=0.167..4.583 rows=10000 loops=1)
If you enjoyed this article, I bet you are going to love my Book and Video Courses as well.
And there is more!
You can earn a significant passive income stream from promoting all these amazing products that I have been creating.
If you're interested in supplementing your income, then join my affiliate program.
Conclusion
The Merge Join algorithm is used by relational database systems when joining larger tables in the order provided by the joining columns, as using the Nested Loops algorithm would have a much higher cost, and using the Hash Join algorithm might require an extra sorting step at the end if the records produced by the Hash Join do not follow the same sorting criteria required by the ORDER BY clause.
While Oracle, SQL Server, and PostgreSQL support the Merge Join algorithm, MySQL doesn’t support it yet.
