Optimizing Joins with CTE

This week, I faced a problem with the performance of a query in PostgreSQL. Let’s give a quick context : 3 tables, 1000 rows on the first one, 50.000 rows on the second one and 5.000.000 rows on the last one. The usage of CTE (Common Table Expressions). And an OUTER JOIN.

I came from a cardinality result of more than 70.000.000 rows with a time response of 3 minutes to only 10 seconds.

Photo by Pixabay on Pexels.com

The query

At the beginning of the project, we’ve written a complex query to satisfy the following need: return all the items which are not impacted by the constraints.

There were 3 tables: the constraint definition, the contraints details and the items.

Constraints table
id (primary key)
active (boolean)
main constraint (varchar not null)
Constraints details table
id (primary key)
constraint id (foreign key)
index (integer not null)
inner constraint 1 (varchar not null)
inner constraint 2 (varchar not null)
Items table
id (primary key)
main constraint (varchar not null)
inner constraint 1 (varchar not null)
inner constraint 2 (varchar not null)

So, to obtain all the items not impacted by the constraints, we’ve implemented a query like that:

WITH cte_c AS (SELECT * 
    FROM constraints c
        JOIN constraints_details cd ON c.id = cd.constraint_id)
SELECT i.*
FROM items i
    LEFT OUTER JOIN c ON i.main_constraint = cte_c.main_constraint
        AND i.inner_constraint_1 = cte_c.inner_constraint_1
        AND i.inner_constraint_2 = cte_c.inner_constraint_2
WHERE cte_c.id IS NULL;

Let’s take a look to the query.

CTE (Common Table Expressions)

The first thing we notice is the CTE getting the join of the two tables on the constraints. This is done because the later Outer Join won’t work as the join condition is done against the two tables of contraints.

Left Outer Join

Then comes the Outer Join between the CTE and the Items table. An Outer Join will give me the items on the left side of the join even if there is no match in the join condition.

Inner join vs Outer join
Inner join vs Outer join

Finally, in the Where clause, I exclude the rows where there is some data from the right side, from the CTE. This will only leave the data from the Items table without any match in the constraints table.

First Optimization

Let’s talk about the performance now. At the beginning, all 3 tables were relatively small: 100 constraints, each one with 1 or 2 details and 10.000 items. These leads to a result with 2.000 results.

The query run in less than 2 seconds. But then, the constraints table started to increase significantly (from 100 constraints to 1.000 constraints). The same for the Items table (from 10.000 items to 500.000 items). This lead to an increase in the response time of the query, from 2 seconds to 60 seconds.

The first optimization I’ve made, was to exclude as much constraints possible (with the active column and others).

WITH cte_c AS (SELECT * 
    FROM constraints c
        JOIN constraints_details cd ON c.id = cd.constraint_id
    WHERE c.active)

The gain was not enough. The query was used in a web request and we had to increase the timeout of the load balancer from 60 seconds to 3 minutes (bad solution). This left me more time to investigate another solution.

Explain Analyze

As expected, soon the query became slower as the tables continue to increase their size. From 1.000 contraints to 50.000. And from 500.000 items to 5.000.000. And of course, the response time of the query reached the 3 minutes of the timeout of the load balancer.

I decided to use EXPLAIN ANALYZE to identify where to place an Index.

                               QUERY PLAN                                                                                                         
-----------------------------------------------------------------------
 Nested Loop Left Join (actual time=8.848..1984.196 rows=21392 loops=1)
   Join Filter: (i.main_constraint = cte_c.main_constraint AND i.inner_constraint_1 = cte_c.inner_constraint_1 AND i.inner_constraint_2 = cte_c.inner_constraint_2)
   Rows Removed by Join Filter: 6379172
   Filter: (item.id IS NULL)
   Rows Removed by Filter: 38447
   ->  Materialize (actual time=0.000..0.011 rows=221 loops=29039)
         ->  Nested Loop (actual time=6.390..8.455 rows=221 loops=1)
               ->  Seq Scan on constraint (actual time=6.358..7.708 rows=214 loops=1)
                     Filter: (constraint.active)
                     Rows Removed by Filter: 32981
               ->  Index Scan using constraint_constraint_detail_id on constraint_detail_id cd (actual time=0.003..0.003 rows=1 loops=214)
                     Index Cond: (constraint.id = constraint_detail.constraint_id)
 Planning Time: 3.547 ms
 Execution Time: 9285.382 ms

The first index to place was in the constraint details table, with the foreign key of the constraints table. Nevertheless, this doesn’t decrease the execution time below the 60 seconds. So, I had to continue trying to optimize the query.

The next most costly part of the query is the join between the CTE and the Items table.

I’ve tried adding an index with multiple columns in the items table (main_constraint, inner_constraint_1 and inner_constraint_2) which are the three columns used in the join clause. But after running the query multiple times after with the index, the time remains constant.

It seems that the index was not used in the join clause.

Second Optimization

How could I optimize the Join between a big table (5.000.000 rows) and a CTE? Then, I came into this article which explains different types of joins. In my case, as I can see in the Explain Analyze, I have a Nested Join between the CTE and the items table. How do I optimize this kind of join?

Nested loop joins are particularly efficient if the outer relation is small, because then the inner loop won’t be executed too often

The article refers to the outer relation as the table in the From clause and the inner relation as the table in the Join clause. This means that I must reverse my Join, having the smaller table (the CTE with the constraints) in the From clause and the bigger table (the items) in the Join clause.

My contribution was to create another CTE with the Items table and join both CTEs in the correct way.

WITH 
	cte_c AS (SELECT * 
    	FROM constraints c
        	JOIN constraints_details cd ON c.id = cd.constraint_id),
	cte_i AS (SELECT i.*
		FROM items i)
SELECT cte_i.*
FROM cte_c
	RIGHT OUTER JOIN cte_i ON cte_i.main_constraint = cte_c.main_constraint
        AND cte_i.inner_constraint_1 = cte_c.inner_constraint_1
        AND cte_i.inner_constraint_2 = cte_c.inner_constraint_2
WHERE cte_c.id IS NULL;

This way, I have the constraints tables in the From clause (which are the smaller tables) and the Items table in the Join clause (which is the bigger table). And the execution time? From 3 minutes to 10 seconds.

References

My New ebook, How to Master Git With 20 Commands, is available now.

Leave a comment

A WordPress.com Website.