Why you should definitely learn SQL Window Functions
Imagine having a tool that can automatically detect JPA and Hibernate performance issues. Wouldn’t that be just awesome?
Well, Hypersistence Optimizer is that tool! And it works with Spring Boot, Spring Framework, Jakarta EE, Java EE, Quarkus, or Play Framework.
So, enjoy spending your time on the things you love rather than fixing performance issues in your production system on a Saturday night!
Introduction
I found this question on the Hibernate forum, and it’s a very good opportunity to show why mastering Windows Functions is a very important skill for any backend software developer.
Domain Model
Let’s assume we have the following entries
table in our database:
| id | c1 | c2 | c3 | c4 | c5 | |----|------|----|----|----|----| | 1 | 2000 | a | 1 | x | 0 | | 2 | 2000 | a | 1 | y | 0 | | 3 | 2000 | a | 1 | z | 0 | | 4 | 2000 | a | 2 | z | 0 | | 5 | 2000 | a | 2 | x | 0 | | 6 | 2000 | b | 1 | x | 0 | | 7 | 2000 | b | 1 | y | 0 | | 8 | 2000 | b | 1 | z | 0 | | 9 | 2000 | b | 2 | z | 0 | | 10 | 2001 | a | 1 | x | 0 | | 11 | 2001 | a | 1 | y | 0 | | 12 | 2001 | a | 1 | z | 0 | | 13 | 2001 | a | 2 | z | 0 | | 14 | 2001 | a | 2 | x | 0 | | 15 | 2001 | a | 2 | y | 0 | | 16 | 2001 | a | 2 | w | 0 | | 17 | 2001 | a | 3 | y | 0 | | 18 | 2001 | a | 3 | w | 0 | | 19 | 2001 | b | 1 | x | 0 | | 20 | 2001 | b | 1 | y | 0 | | 21 | 2001 | b | 2 | x | 0 | | 22 | 2001 | b | 2 | z | 0 |
The problem
As stated in the question, the user’s goal is to:
I want to update column c5 (to 1) of each group base on columns c1, c2, c3 where c3 is maximum in the same c1, c2 group.
Easy peasy!
As I already explained, SQL is a Magic Wand. Not only SQL has been a driving force in RDBMS’s widespread adoption, but even NewSQL databases (Google Spanner, CockroachDB) or Data streaming frameworks, like Kafka, have adopted SQL.
Window Functions to the rescue!
So, this is how you can solve this problem:
int updateCount = entityManager.createNativeQuery(""" update entries set c5 = 1 where id in ( select id from ( select *, MAX (c3) OVER ( PARTITION BY c1, c2 ) as max_c3 from entries ) t where t.c3 = t.max_c3 ) """) .executeUpdate(); assertEquals(7, updateCount);
Because Window Functions allow you to aggregate values without breaking the returning table result set, we can easily find the matching identifiers which can be passed to the UPDATE statement.
The Execution Plan for the UPDATE statement above looks as follows:
explain analyze update entries set c5 = 1 where id in ( select id from ( select *, MAX (c3) OVER (PARTITION BY c1, c2) as max_c3 from entries ) t where t.c3 = t.max_c3 ) Update on entries (cost=15.27..23.30 rows=1 width=2134) (actual time=0.154..0.154 rows=0 loops=1) -> Nested Loop (cost=15.27..23.30 rows=1 width=2134) (actual time=0.094..0.104 rows=7 loops=1) -> HashAggregate (cost=15.12..15.13 rows=1 width=1084) (actual time=0.083..0.085 rows=7 loops=1) Group Key: t.id -> Subquery Scan on t (cost=12.85..15.12 rows=1 width=1084) (actual time=0.063..0.080 rows=7 loops=1) Filter: (t.c3 = t.max_c3) Rows Removed by Filter: 15 -> WindowAgg (cost=12.85..14.25 rows=70 width=1056) (actual time=0.053..0.065 rows=22 loops=1) -> Sort (cost=12.85..13.02 rows=70 width=1052) (actual time=0.044..0.045 rows=22 loops=1) Sort Key: entries_1.c1, entries_1.c2 Sort Method: quicksort Memory: 26kB -> Seq Scan on entries entries_1 (cost=0.00..10.70 rows=70 width=1052) (actual time=0.009..0.011 rows=22 loops=1) -> Index Scan using entries_pkey on entries (cost=0.14..8.16 rows=1 width=1054) (actual time=0.002..0.002 rows=1 loops=7) Index Cond: (id = t.id) Planning time: 0.201 ms Execution time: 0.230 ms
What if you can’t use Window Functions?
Nowadays, all major DBs support Window Functions, MySQL 8.0 being one of the last major RDBMS to join the club. Oracle, PostgreSQL, and SQL Server have been supporting Window Functions for quite some time now.
However, assuming that you are stuck with some old MySQL instance, you can still solve this problem using the following SQL query:
int updateCount = entityManager.createNativeQuery(""" update entries set c5 = 1 where id in ( select e.id from entries e inner join ( select c1, c2, max(c3) as max_c3 from entries group by c1, c2 ) t on e.c1 = t.c1 and e.c2 = t.c2 and e.c3 = t.max_c3 ) """ ) .executeUpdate(); assertEquals(7, updateCount);
The Execution Plan for the UPDATE statement above looks as follows:
explain analyze update entries set c5 = 1 where id in ( select e.id from entries e inner join ( select c1, c2, max(c3) as max_c3 from entries group by c1, c2 ) t on e.c1 = t.c1 and e.c2 = t.c2 and e.c3 = t.max_c3 ) Update on entries (cost=25.49..26.22 rows=1 width=1612) (actual time=0.112..0.112 rows=0 loops=1) -> Nested Loop (cost=25.49..26.22 rows=1 width=1612) (actual time=0.081..0.090 rows=7 loops=1) -> HashAggregate (cost=25.35..25.36 rows=1 width=562) (actual time=0.074..0.075 rows=7 loops=1) Group Key: e.id -> Hash Join (cost=13.85..25.35 rows=1 width=562) (actual time=0.067..0.070 rows=7 loops=1) Hash Cond: ((e.c1 = t.c1) AND ((e.c2)::text = (t.c2)::text) AND (e.c3 = t.max_c3)) -> Seq Scan on entries e (cost=0.00..10.70 rows=70 width=538) (actual time=0.016..0.019 rows=22 loops=1) -> Hash (cost=12.62..12.62 rows=70 width=1072) (actual time=0.029..0.029 rows=4 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 9kB -> Subquery Scan on t (cost=11.23..12.62 rows=70 width=1072) (actual time=0.021..0.023 rows=4 loops=1) -> HashAggregate (cost=11.23..11.92 rows=70 width=524) (actual time=0.017..0.018 rows=4 loops=1) Group Key: entries_1.c1, entries_1.c2 -> Seq Scan on entries entries_1 (cost=0.00..10.70 rows=70 width=524) (actual time=0.004..0.005 rows=22 loops=1) -> Index Scan using entries_pkey on entries (cost=0.14..0.85 rows=1 width=1054) (actual time=0.001..0.002 rows=1 loops=7) Index Cond: (id = e.id) Planning time: 0.293 ms Execution time: 0.219 ms
Which one is better?
If you compare both Execution Plans, you can see that the Windows Function query yields a better cost than the other one.
So, not only that the query is much easier to read, but chances are that it’s going to be more efficient as well.
I'm running an online workshop on the 11th of October about High-Performance SQL.If you enjoyed this article, I bet you are going to love my Book and Video Courses as well.
Conclusion
As I already explained, it’s time to break free from the SQL-92 mindset.
SQL has many features like Window Functions, Common Table Expressions, PIVOT, Derived Tables and set operations that you can use to find the right answer to your data processing questions. For more about new SQL features, check out Markus Winand’s Modern SQL website.
