Did you know -
When you use a CTE in MySQL, should the optimizer choose to materialize it, it may automatically create an index on the fly, if it thinks your query could benefit from it?
Let's dig a bit into this.
First a little context, in case anyone is not aware, the materialization approach to CTE processing is when MySQL chooses to execute the CTE query, and store its results in an internal temporary table, to be reused in the parent query. This tends to be a useful approach when the CTE is referenced multiple times.
The alternate approach is to merge the CTE into the parent query. I won't discuss the merge approach for now.
For example, say you ran the following query:
WITH data_cte AS (
/* expensive query */
SELECT group_id, AVG(x) AS avg_x, MAX(x) AS max_x FROM table_100m GROUP BY group_id
)
/* "parent" part of the query; notice multiple references to data_cte
SELECT group_id, avg_x FROM data_cte WHERE max_x = 100000 -- search by max_x
UNION ALL
SELECT group_id, avg_x FROM data_cte WHERE max_x = 99999
UNION ALL
SELECT group_id, avg_x FROM data_cte WHERE max_x = 99998
;
For this query, MySQL's optimizer may be able to materialize the CTE. In addition to that, because the parent query searches by max_x three times, the optimizer may decide, "hey, I'd better create an index on max_x in the materialization."
The execution plan shows that:
1. The CTE called data_cte was indeed materialized
...and...
2. For all three lookups by max_x in the parent query, we can see in the plan an index lookup occurred. e.g. Index lookup on avg_data_cte using <auto_key0> (max_x = 99998). MySQL's optimizer automatically created an index to assist in these searches!
Execution plan:
| -> Append (cost=10.5 rows=30) (actual time=16912..16912 rows=2115 loops=1)
-> Stream results (cost=3.5 rows=10) (actual time=16912..16912 rows=419 loops=1)
-> Filter: (data_cte.max_x = 100000) (cost=3.5 rows=10) (actual time=16912..16912 rows=419 loops=1)
-> Index lookup on data_cte using <auto_key0> (max_x = 100000) (cost=0.35..3.5 rows=10) (actual time=16912..16912 rows=419 loops=1)
-> Materialize CTE data_cte if needed (cost=0..0 rows=0) (actual time=16912..16912 rows=10000 loops=1)
-> Table scan on <temporary> (actual time=16909..16909 rows=10000 loops=1)
-> Aggregate using temporary table (actual time=16909..16909 rows=10000 loops=1)
-> Table scan on table_100m (cost=10.2e+6 rows=99.8e+6) (actual time=2.2..8625 rows=100e+6 loops=1)
-> Stream results (cost=3.5 rows=10) (actual time=0.0248..0.117 rows=786 loops=1)
-> Filter: (data_cte.max_x = 99999) (cost=3.5 rows=10) (actual time=0.014..0.0798 rows=786 loops=1)
-> Index lookup on data_cte using <auto_key0> (max_x = 99999) (cost=0.35..3.5 rows=10) (actual time=0.0129..0.0546 rows=786 loops=1)
-> Materialize CTE data_cte if needed (query plan printed elsewhere) (cost=0..0 rows=0) (never executed)
-> Stream results (cost=3.5 rows=10) (actual time=0.00196..0.108 rows=910 loops=1)
-> Filter: (data_cte.max_x = 99998) (cost=3.5 rows=10) (actual time=0.00183..0.0774 rows=910 loops=1)
-> Index lookup on data_cte using <auto_key0> (max_x = 99998) (cost=0.35..3.5 rows=10) (actual time=0.00158..0.0493 rows=910 loops=1)
-> Materialize CTE data_cte if needed (query plan printed elsewhere) (cost=0..0 rows=0) (never executed)
Notice the three instances of Index lookup on data_cte using <auto_key0> - that's the automagic index goodness in action.
There's some meta-context here. There's a recurring debate in internet SQL discussions about what is better: CTEs or temp tables. In my view, the two apply to only partially overlapping use cases, but the idea is you could rewrite the above query by splitting it into two statements, the first of which substitutes a temp table for the CTE, like so:
CREATE TEMPORARY TABLE data_tmp AS
SELECT group_id, AVG(x) AS avg_x, MAX(x) AS max_x FROM table_100m GROUP BY group_id;
Then query the temp table, like so:
SELECT group_id, avg_x FROM data_tmp WHERE max_x = 100000 -- search by max_x
UNION ALL
SELECT group_id, avg_x FROM data_tmp WHERE max_x = 99999
UNION ALL
SELECT group_id, avg_x FROM data_tmp WHERE max_x = 99998
;
Unfortunately, MySQL presents a blocker to this query: an unfortunate limitation that prevents me from even testing this approach because you can't reference a temporary table multiple times in one query.
ERROR 1137 (HY000): Can't reopen table: 'data_tmp'
But I could instead try creating a "normal" table that is only used temporarily, to serve this query, then drop it after.
Below is the execution plan from this approach. Observe that it did NOT use an index - instead it ran Table scan on data_tbl three times - because we didn't explicitly create an index:
| -> Append (cost=3011 rows=2989) (actual time=0.0747..15.2 rows=2115 loops=1)
-> Stream results (cost=1004 rows=996) (actual time=0.0734..6.31 rows=419 loops=1)
-> Filter: (data_tbl.max_x = 100000) (cost=1004 rows=996) (actual time=0.0675..6.21 rows=419 loops=1)
-> Table scan on data_tbl (cost=1004 rows=9963) (actual time=0.0488..5.69 rows=10000 loops=1)
-> Stream results (cost=1004 rows=996) (actual time=0.0383..4.23 rows=786 loops=1)
-> Filter: (data_tbl.max_x = 99999) (cost=1004 rows=996) (actual time=0.0304..4.15 rows=786 loops=1)
-> Table scan on data_tbl (cost=1004 rows=9963) (actual time=0.0219..3.74 rows=10000 loops=1)
-> Stream results (cost=1004 rows=996) (actual time=0.025..4.56 rows=910 loops=1)
-> Filter: (data_tbl.max_x = 99998) (cost=1004 rows=996) (actual time=0.021..4.47 rows=910 loops=1)
-> Table scan on data_tbl (cost=1004 rows=9963) (actual time=0.0122..3.99 rows=10000 loops=1)
But we could explicitly create an index. After doing so, and rerunning the query, we see a similar execution plan to the search portion of the CTE query's plan:
| -> Append (cost=277 rows=2115) (actual time=0.0567..3.73 rows=2115 loops=1)
-> Stream results (cost=63.7 rows=419) (actual time=0.056..0.928 rows=419 loops=1)
-> Index lookup on data_tbl using my_idx (max_x = 100000), with index condition: (data_tbl.max_x = 100000) (cost=63.7 rows=419) (actual time=0.0504..0.88 rows=419 loops=1)
-> Stream results (cost=100 rows=786) (actual time=0.017..1.27 rows=786 loops=1)
-> Index lookup on data_tbl using my_idx (max_x = 99999), with index condition: (data_tbl.max_x = 99999) (cost=100 rows=786) (actual time=0.0151..1.19 rows=786 loops=1)
-> Stream results (cost=113 rows=910) (actual time=0.0143..1.42 rows=910 loops=1)
-> Index lookup on data_tbl using my_idx (max_x = 99998), with index condition: (data_tbl.max_x = 99998) (cost=113 rows=910) (actual time=0.012..1.32 rows=910 loops=1)
When this debate arises on the r/sql subreddit, I see temp tables getting all the upvotes over CTEs. I'm not always in agreement with the reasoning. Probably the single biggest fallacy I notice is they discuss how the CTEs vs. temp tables are processed under the hood, without specifying a DBMS. But you've heard my rants on that subject about 100 times.
One of the reasons sometimes offered is that you can index a temp table, but (supposedly) can't do that with a CTE. As we've seen just now, this is not true in MySQL. In fact one could argue the CTE is more automagic and therefore preferable.
Also consider that merging may be the better approach than materialization, which is also part of the optimizer's automagic! If you create a temp table, that's forcing a materialization, whether it's better or not. If the MySQL optimizer is doing it's job right - which it usually does - we should let it do its thing and make the right decision. To be fair, it's from from perfect.
I'd also wager that at least 80% of developers wouldn't think to make the explicit index on their temp table.
Automagic ftw, in my opinion. Unless the automagic fails, then take other measures.
Here's one example of a Reddit comment on the debate. This is just a fundamental misunderstanding. (I'd feel better about it if it didn't get piles of upvotes.)
Contradicting the Reddit commenter, and their 26 upvoters, we've observed already that the CTE materialization MySQL can perform means the CTE query is NOT executed once per reference While it's possible they are correct about *some* DBMS out there (and perhaps in some querying scenarios but not all), every one of the several other DBMSs for which I've tested this subject is also capable of materializing CTEs.
It used to be the case in Postgres that CTEs were always materialized! That isn't the best approach 100% of the time.
Funny
thing is, in MySQL, the materialized CTE literally becomes a temp table
under the hood! So it almost turns into an argument of, "My temp table
is better than your temp table!" You just don't have to manage it
yourself using the CTE approach, and again, MySQL might choose the merge
approach which may run better. Automagic ftw!
But I may be digressing a bit. I'll cut the discussion here unless anyone has any questions/comments.
No comments:
Post a Comment