Hash Join Algorithm
Imagine having a tool that can automatically detect JPA and Hibernate performance issues. Wouldn’t that be just awesome?
Well, Hypersistence Optimizer is that tool! And it works with Spring Boot, Spring Framework, Jakarta EE, Java EE, Quarkus, or Play Framework.
So, enjoy spending your time on the things you love rather than fixing performance issues in your production system on a Saturday night!
Introduction
In this article, we are going to see how the Hash Join Algorithm works and when it’s suitable for a relational database system to employ it in order to execute an SQL JOIN query.
Data Sets
Let’s consider we have two relations, a parent Post
and a child PostComment
, that look as follows:
Because the postId
attribute in the PostComment
relation references the id
attribute in the parent Post
relation, the two entities form a one-to-many relationship.
The parent Post
relation has 1000 records that look as follows:
| id | title | |------|---------------| | 1 | Post no. 1 | | 2 | Post no. 2 | | .. | .. | | 999 | Post no. 999 | | 1000 | Post no. 1000 |
And, the child PostComment
relation has 10000 rows that are associated to the 1000 Post
records:
| id | review | postId | |-------|-------------------|---------| | 1 | Comment no. 1 | 1 | | 2 | Comment no. 2 | 1 | | .. | .. | .. | | 9999 | Comment no. 9999 | 1000 | | 10000 | Comment no. 10000 | 1000 |
We are interested in joining the Post
and PostComment
records by matching the id
attribute of the Post
relation with the postId
attribute of the PostComment
relation so that we can build a projection that contains the following attributes:
- the
Post
identifier - the
Post
title - the
PostComment
review
In our case, this is how the aforementioned report should look like:
| post_id | post_title | review | |---------|---------------|-------------------| | 1 | Post no. 1 | Comment no. 1 | | 1 | Post no. 1 | Comment no. 2 | | 1 | Post no. 1 | Comment no. 3 | | 1 | Post no. 1 | Comment no. 4 | | 1 | Post no. 1 | Comment no. 5 | | 1 | Post no. 1 | Comment no. 6 | | 1 | Post no. 1 | Comment no. 7 | | 1 | Post no. 1 | Comment no. 8 | | 1 | Post no. 1 | Comment no. 9 | | .. |.. | .. | | 1000 | Post no. 1000 | Comment no. 9999 | | 1000 | Post no. 1000 | Comment no. 10000 |
Hash Join Algorithm
The Hash Join Algorithm consists of two steps. In the first step, it creates an in-memory hash table structure from the records of the relation with fewer elements.
Map<Long, Post> postMap = new HashMap<>(); for (Post post : posts) { postMap.put(post.getId(), post); }
As you can see in the above code snippet, the attribute used by the join condition becomes the key and the record itself becomes the value of the in-memory hash map.
In the second step, the larger relation is iterated and the smaller table record is located using the previously build hash map:
List<Tuple> tuples = new ArrayList<>(); for (PostComment postComment : postComments) { Long postId = postComment.getPostId(); Post post = postMap.get(postId); if (post != null) { tuples.add( new Tuple() .add("post_id", postComment.getPostId()) .add("post_title", post.getTitle()) .add("review", postComment.getReview()) ); } }
Unlike the Nested Loops algorithm, the complexity of the Hash Join algorithm is linear (e.g., O(N + M)
), and the larger the size of the relations, the more processing will be needed to find all matching records, as illustrated by the following graph:
The Hash Join algorithm may be used by relational database systems when joining relations using an EquiJoin predicate if one database relation is rather large and there is enough memory to hold the in-memory HashTable structure that’s needed to be built in the first step.
For instance, running this SQL query on PostgreSQL when joining the a post
table with 1000 records and a post_comment
table with 10,000 rows:
SELECT p.id AS post_id, p.title AS post_title, pc.review AS review FROM post p INNER JOIN post_comment pc ON pc.post_id = p.id
produces a Hash Join, as illustrated by the underlying execution plan:
Hash Join (cost=29.50..238.86 rows=10000 width=1040) (actual time=0.821..10.278 rows=10000 loops=1) Hash Cond: (pc.post_id = p.id) -> Seq Scan on post_comment pc (cost=0.00..183.00 rows=10000 width=524) (actual time=0.155..2.833 rows=10000 loops=1) -> Hash (cost=17.00..17.00 rows=1000 width=524) (actual time=0.534..0.535 rows=1000 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 60kB -> Seq Scan on post p (cost=0.00..17.00 rows=1000 width=524) (actual time=0.036..0.272 rows=1000 loops=1)
If you enjoyed this article, I bet you are going to love my Book and Video Courses as well.
And there is more!
You can earn a significant passive income stream from promoting all these amazing products that I have been creating.
If you're interested in supplementing your income, then join my affiliate program.
Conclusion
The Hash Join algorithm is a very common strategy used by relational database systems when joining larger tables because of the cost of using the Nested Loops algorithm would be much higher.
Traditionally, MySQL has been offering only the Nested Loops algorithm would be much higher, but since version 8.0.18, it supports the Hash Join algorithm as well.
On the other hand, Oracle, PostgreSQL, and SQL Server have been supporting the Hash Join algorithm for a very long time.
