Hash Join Algorithm

Imagine having a tool that can automatically detect JPA and Hibernate performance issues. Hypersistence Optimizer is that tool!

Introduction

In this article, we are going to see how the Hash Join Algorithm works and when it’s suitable for a relational database system to employ it in order to execute an SQL JOIN query.

Data Sets

Let’s consider we have two relations, a parent Post and a child PostComment, that look as follows:

Hash Join Relations

Because the postId attribute in the PostComment relation references the id attribute in the parent Post relation, the two entities form a one-to-many relationship.

The parent Post relation has 1000 records that look as follows:

| id   | title         |
|------|---------------|
| 1    | Post no. 1    |
| 2    | Post no. 2    |
| ..   | ..            |
| 999  | Post no. 999  |
| 1000 | Post no. 1000 |

And, the child PostComment relation has 10000 rows that are associated to the 1000 Post records:

| id    | review            | postId  |
|-------|-------------------|---------|
| 1     | Comment no. 1     | 1       |
| 2     | Comment no. 2     | 1       |
| ..    | ..                | ..      |
| 9999  | Comment no. 9999  | 1000    |
| 10000 | Comment no. 10000 | 1000    |

We are interested in joining the Post and PostComment records by matching the id attribute of the Post relation with the postId attribute of the PostComment relation so that we can build a projection that contains the following attributes:

  • the Post identifier
  • the Post title
  • the PostComment review

In our case, this is how the aforementioned report should look like:

| post_id | post_title    | review            |
|---------|---------------|-------------------|
| 1       | Post no. 1    | Comment no. 1     |
| 1       | Post no. 1    | Comment no. 2     |
| 1       | Post no. 1    | Comment no. 3     |
| 1       | Post no. 1    | Comment no. 4     |
| 1       | Post no. 1    | Comment no. 5     |
| 1       | Post no. 1    | Comment no. 6     |
| 1       | Post no. 1    | Comment no. 7     |
| 1       | Post no. 1    | Comment no. 8     |
| 1       | Post no. 1    | Comment no. 9     |
| ..      |..             | ..                |
| 1000    | Post no. 1000 | Comment no. 9999  |
| 1000    | Post no. 1000 | Comment no. 10000 |

Hash Join Algorithm

The Hash Join Algorithm consists of two steps. In the first step, it creates an in-memory hash table structure from the records of the relation with fewer elements.

Map<Long, Post> postMap = new HashMap<>();

for (Post post : posts) {
    postMap.put(post.getId(), post);
}

As you can see in the above code snippet, the attribute used by the join condition becomes the key and the record itself becomes the value of the in-memory hash map.

In the second step, the larger relation is iterated and the smaller table record is located using the previously build hash map:

List<Tuple> tuples = new ArrayList<>();

for (PostComment postComment : postComments) {
    Long postId = postComment.getPostId();
    Post post = postMap.get(postId);

    if (post != null) {
        tuples.add(
            new Tuple()
                .add("post_id", postComment.getPostId())
                .add("post_title", post.getTitle())
                .add("review", postComment.getReview())
        );
    }
}

Unlike the Nested Loops algorithm, the complexity of the Hash Join algorithm is linear (e.g., O(N + M)), and the larger the size of the relations, the more processing will be needed to find all matching records, as illustrated by the following graph:

Hash Join Algorithm Complexity

The Hash Join algorithm may be used by relational database systems when the joining relations don’t need to be sorted prior to being scanned and if there is enough memory to hold the in-memory structure that’s built in the first step.

For instance, running this SQL query on PostgreSQL when joining the a post table with 1000 records and a post_comment table with 10,000 rows:

SELECT
   p.id AS post_id,
   p.title AS post_title,
   pc.review  AS review
FROM post p
INNER JOIN post_comment pc ON pc.post_id = p.id

produces a Hash Join, as illustrated by the underlying execution plan:

Hash Join  
  (cost=29.50..238.86 rows=10000 width=1040) 
  (actual time=0.821..10.278 rows=10000 loops=1)
  Hash Cond: (pc.post_id = p.id)
  ->  Seq Scan on post_comment pc  
        (cost=0.00..183.00 rows=10000 width=524) 
        (actual time=0.155..2.833 rows=10000 loops=1)
  ->  Hash  
          (cost=17.00..17.00 rows=1000 width=524) 
          (actual time=0.534..0.535 rows=1000 loops=1)
        Buckets: 1024  Batches: 1  Memory Usage: 60kB
        ->  Seq Scan on post p  
              (cost=0.00..17.00 rows=1000 width=524) 
              (actual time=0.036..0.272 rows=1000 loops=1)

I'm running an online workshop on the 27th of May about High-Performance SQL.

If you enjoyed this article, I bet you are going to love my Book and Video Courses as well.

Conclusion

The Hash Join algorithm is a very common strategy used by relational database systems when joining larger tables because the cost of using the Nested Loops algorithm would be much higher.

Traditionally, MySQL has been offering only the Nested Loops algorithm would be much higher, but since version 8.0.18, it supports the Hash Join algorithm as well.

On the other hand, Oracle, PostgreSQL, and SQL Server have been supporting the Hash Join algorithm for a very long time.

Transactions and Concurrency Control eBook

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.