Why you should definitely learn SQL Window Functions

Introduction

I found this question on the Hibernate forum, and it’s a very good opportunity to show why mastering Windows Functions is a very important skill for any backend software developer.

Domain Model

Let’s assume we have the following entries table in our database:

id c1 c2 c3 c4 c5
1 2000 a 1 x 0
2 2000 a 1 y 0
3 2000 a 1 z 0
4 2000 a 2 z 0
5 2000 a 2 x 0
6 2000 b 1 x 0
7 2000 b 1 y 0
8 2000 b 1 z 0
9 2000 b 2 z 0
10 2001 a 1 x 0
11 2001 a 1 y 0
12 2001 a 1 z 0
13 2001 a 2 z 0
14 2001 a 2 x 0
15 2001 a 2 y 0
16 2001 a 2 w 0
17 2001 a 3 y 0
18 2001 a 3 w 0
19 2001 b 1 x 0
20 2001 b 1 y 0
21 2001 b 2 x 0
22 2001 b 2 z 0

The problem

As stated in the question, the user’s goal is to:

I want to update column c5 (to 1) of each group base on column c1, c2, c3 where c3 is maximum in same c1, c2 group.

Easy peasy!

As I already explained, SQL is a Magic Wand. Not only that SQL has been a driving force in RDBMS widespread adoption, but even NewSQL databases (Google Spanner, CockroachDB) or Data streaming frameworks, like Kafka, have adopted SQL.

Window Functions to the rescue!

So, this is how you can solve this problem:

int updateCount = entityManager.createNativeQuery(
    "update entries set c5 = 1 " +
    "where id in " +
    "( " +
    "    select id " +
    "    from ( " +
    "        select *, MAX (c3) OVER (PARTITION BY c1, c2) as max_c3 " +
    "        from entries " +
    "    ) t " +
    "    where t.c3 = t.max_c3 " +
    ") ")
.executeUpdate();

assertEquals( 7, updateCount );

Because Window Functions allow you to aggregate values without breaking the returning table result set, we can easily find the matching identifiers which can be passed to the UPDATE statement.

The Execution Plan for the UPDATE statement above looks as follows:

explain analyze 
update entries set c5 = 1
where id in
(
    select id
    from (
        select *, MAX (c3) OVER (PARTITION BY c1, c2) as max_c3
        from entries
    ) t
    where t.c3 = t.max_c3
)

Update on entries  (cost=15.27..23.30 rows=1 width=2134) (actual time=0.154..0.154 rows=0 loops=1)
  ->  Nested Loop  (cost=15.27..23.30 rows=1 width=2134) (actual time=0.094..0.104 rows=7 loops=1)
  ->  HashAggregate  (cost=15.12..15.13 rows=1 width=1084) (actual time=0.083..0.085 rows=7 loops=1)
    Group Key: t.id
    ->  Subquery Scan on t  (cost=12.85..15.12 rows=1 width=1084) (actual time=0.063..0.080 rows=7 loops=1)
      Filter: (t.c3 = t.max_c3)
      Rows Removed by Filter: 15
      ->  WindowAgg  (cost=12.85..14.25 rows=70 width=1056) (actual time=0.053..0.065 rows=22 loops=1)
        ->  Sort  (cost=12.85..13.02 rows=70 width=1052) (actual time=0.044..0.045 rows=22 loops=1)
        Sort Key: entries_1.c1, entries_1.c2
        Sort Method: quicksort  Memory: 26kB
        ->  Seq Scan on entries entries_1  (cost=0.00..10.70 rows=70 width=1052) (actual time=0.009..0.011 rows=22 loops=1)
  ->  Index Scan using entries_pkey on entries  (cost=0.14..8.16 rows=1 width=1054) (actual time=0.002..0.002 rows=1 loops=7)
    Index Cond: (id = t.id)
Planning time: 0.201 ms
Execution time: 0.230 ms

What if you can’t use Window Functions?

Nowadays, all major DBs support Window Functions, MySQL 8.0 being one of the last major RDBMS to join the club. Oracle, PostgreSQL and SQL Server have been supporting Window Functions for quite some time now.

However, assuming that you are stuck with some old MySQL instance, you can still solve this problem using the following SQL query:

int updateCount = entityManager.createNativeQuery(
    "update entries set c5 = 1 " +
    "where id in " +
    "( " +
    "    select e.id " +
    "    from entries e  " +
    "    inner join ( " +
    "        select c1, c2, max(c3) as max_c3 " +
    "        from entries " +
    "        group by c1, c2 " +
    "    ) t " +
    "    on e.c1 = t.c1 and e.c2 = t.c2 and e.c3 = t.max_c3  " +
    ") " )
.executeUpdate();

assertEquals( 7, updateCount );

The Execution Plan for the UPDATE statement above looks as follows:

explain analyze 
update entries set c5 = 1
where id in
(
    select e.id
	from entries e 
	inner join (
		select c1, c2, max(c3) as max_c3
		from entries
		group by c1, c2
	) t
	on e.c1 = t.c1 and e.c2 = t.c2 and e.c3 = t.max_c3 
)

Update on entries  (cost=25.49..26.22 rows=1 width=1612) (actual time=0.112..0.112 rows=0 loops=1)
  ->  Nested Loop  (cost=25.49..26.22 rows=1 width=1612) (actual time=0.081..0.090 rows=7 loops=1)
  ->  HashAggregate  (cost=25.35..25.36 rows=1 width=562) (actual time=0.074..0.075 rows=7 loops=1)
    Group Key: e.id
    ->  Hash Join  (cost=13.85..25.35 rows=1 width=562) (actual time=0.067..0.070 rows=7 loops=1)
      Hash Cond: ((e.c1 = t.c1) AND ((e.c2)::text = (t.c2)::text) AND (e.c3 = t.max_c3))
      ->  Seq Scan on entries e  (cost=0.00..10.70 rows=70 width=538) (actual time=0.016..0.019 rows=22 loops=1)
      ->  Hash  (cost=12.62..12.62 rows=70 width=1072) (actual time=0.029..0.029 rows=4 loops=1)
        Buckets: 1024  Batches: 1  Memory Usage: 9kB
        ->  Subquery Scan on t  (cost=11.23..12.62 rows=70 width=1072) (actual time=0.021..0.023 rows=4 loops=1)
        ->  HashAggregate  (cost=11.23..11.92 rows=70 width=524) (actual time=0.017..0.018 rows=4 loops=1)
          Group Key: entries_1.c1, entries_1.c2
          ->  Seq Scan on entries entries_1  (cost=0.00..10.70 rows=70 width=524) (actual time=0.004..0.005 rows=22 loops=1)
  ->  Index Scan using entries_pkey on entries  (cost=0.14..0.85 rows=1 width=1054) (actual time=0.001..0.002 rows=1 loops=7)
    Index Cond: (id = e.id)
Planning time: 0.293 ms
Execution time: 0.219 ms

Which one is better?

If you compare both Execution Plans, you can see that the Windows Function query yields a better cost than the other one.

So, not only that the query is much easier to read, but chances are that it’s going to be more efficient as well.

If you enjoyed this article, I bet you are going to love my book as well.

Conclusion

As I already explained, it’s time to break free from the SQL-92 mindset.

SQL has many features like Window Functions, Common Table Expressions, PIVOT, derived tables and set operations that you can use to find the right answer to your data processing questions. For more about new SQL features, check out Markus Winand’s Modern SQL web site.

Enter your email address to follow this blog and receive notifications of new posts by email.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s