Index Selectivity

Last modified:

Imagine having a tool that can automatically detect JPA and Hibernate performance issues. Wouldn’t that be just awesome?

Well, Hypersistence Optimizer is that tool! And it works with Spring Boot, Spring Framework, Jakarta EE, Java EE, Quarkus, or Play Framework.

So, enjoy spending your time on the things you love rather than fixing performance issues in your production system on a Saturday night!

Introduction

In this article, we are going to see how Index Selectivity works in relational database systems and why the database Optimizer might choose to avoid using an index if the number of matching records is large.

Index selectivity is inversely proportional to the number of index entries matched by a given value. So, a unique index has the highest selectivity because only a single entry can be matched by any given value.

On the other hand, if column values are skewed, then a column value matching a large number of table records is going to have a low selectivity.

Domain Model

Let’s consider we have the following task table:

Index selectivity task table

The status column is task_status, which is a custom PosgreSQL Enum type we created like this:

CREATE TYPE task_status AS ENUM ('TO_DO', 'DONE', 'FAILED')

The status column values are unevenly distributed across the task records, so when counting the number of matching records for every task_status value:

SELECT COUNT(*) as matching_records, status
FROM task
GROUP BY status
ORDER BY 1 DESC

We can see that the records are skewed since 95% of the table records have the status value of DONE while only 1% of the table records have the value of TO_DO:

| matching_records | status |
|------------------|--------|
| 95000            | DONE   |
| 4000             | FAILED |
| 1000             | TO_DO  |

Index Selectivity

Now, let’s create a B+Tree index on the status column:

CREATE INDEX IF NOT EXISTS idx_task_status ON task (status)

When executing the following query:

EXPLAIN ANALYZE
SELECT *
FROM task
WHERE status = 'TO_DO'

PostgreSQL chooses the following execution plan:

Index Scan using idx_task_status on task  
  (cost=0.29..294.70 rows=1013 width=22) 
  (actual time=0.162..0.342 rows=1000 loops=1)
  Index Cond: (status = 'TO_DO'::task_status)

However, when running the same query against the status value of DONE, we get the following SQL execution plan:

Seq Scan on task  
  (cost=0.00..1887.00 rows=95120 width=22) 
  (actual time=0.017..11.450 rows=95000 loops=1)
  Filter: (status = 'DONE'::task_status)
  Rows Removed by Filter: 5000

As I explained in this article, relational databases use a Cost-Based Optimizer to choose the most efficient Execution Plan for a given SQL query execution.

Because the status value of DONE matches 95% of the table records, PostgreSQL choses to use a sequential scan instead of scanning the idx_task_status index.

When using the idx_task_status index, the database needs to traverse the index, locate the row identifier, and then load the table record by the record identifier in order to select the rest of the columns required by the SQL query project.

In the first example, the number of matched records is 1000, which is 1% of the entire table record. The cost of scanning the index and loading the rest of the columns from the table is lower than scanning the entire table, so PostgreSQL uses the idx_task_status index for the TO_DO predicate value.

In the second example, the predicate value of DONE has a very low index selectivity since it matches 95,000 table records. In this case, scanning the task table page by page is faster than doing 95,000 lookups from the idx_task_status index to the task table.

Partial or Filtered Indexes

When the indexed column values are skewed, it’s more efficient to use a Partial or Filtered Index, like the following one:

CREATE INDEX idx_task_status ON task (status) WHERE status <> 'DONE'

Since only the TO_DO and FAILED status values have high index selectivity, it’s better to build the index only for these values and skip the row identifiers having the status value of DONE. This will help us reduce the index size.

Now, because PostgreSQL provides only a pg_indexes_size function that gives us the size of all indexes associated to a given table, we will need to calculate the idx_task_status index size by subtracting the size of the Primary Key index.

As I explained in this article, PostgreSQL creates a default index on the Primary Key, which in our case looks as follows:

| tablename | indexname | indexdef                                                      |
|-----------|-----------|---------------------------------------------------------------|
| task      | task_pkey | CREATE UNIQUE INDEX task_pkey ON public.task USING btree (id) |

When calculating the size of all indexes on the task tables

SELECT
    pg_size_pretty(
        pg_indexes_size(relid)
    ) as "Index Size"
FROM
    pg_statio_user_tables
WHERE
    pg_statio_user_tables.relname = 'task'

We get the value of 2208 kB for the task_pkey index:

| Index | Size |
|-------|------|
| 2208  | kB   |

When creating the Partial Index that contains only the TO_DO and FAILED status values, the overall task table index size is:

| Index | Size |
|-------|------|
| 2264  | kB   |

So, the size of the Partial Index is just 56 kB.

However, if we create the Full Index that includes all values of the status column, the overall task table index size is:

| Index | Size |
|-------|------|
| 2904  | kB   |

So, the Full Index has a size of 696 kB and no advantage of the Partial Index since the index selectivity of the value of DONE is too low to be considered by our query predicate.

I'm running an online workshop on the 20-21 and 23-24 of November about High-Performance Java Persistence.

If you enjoyed this article, I bet you are going to love my Book and Video Courses as well.

Conclusion

When running an SQL query, an index can help us speed up the query execution if the selectivity is high. Otherwise, the database might choose to avoid using an index.

If you have columns with skewed data, using a Partial Index can help you save space, therefore allowing you to store more index and table records in the Buffer Pool.

Follow @vlad_mihalcea