Saturday, July 19, 2025

Playing SQL Mythbuster - or - I really ought to lay off the SQL on social media

Often, when I visit SQL-related forums and social media, I encounter posts that are half-truths, overgeneralizations, and sometimes, outright myths. This is especially true on the subject of SQL performance. 

The problem is I can't just scroll past. I have this compulsion to play SQL Mythbuster. If the posts had, say, 2 likes and that's it, I'd probably move on. But they often have hundreds of positive reactions and comments in response to some dubious, unsupported, or at least badly qualified claims. 


 

For example, here's a recent LinkedIn post:



Here's why I'm not a fan of this post (no offense to the author):

  1. The proposed optimization is a magic performance trick. It is essentially saying, "Hey, instead of writing this straightforward query, you can rearrange it in this magic way to make it go fast!" The post doesn't attempt to justify the claim other than by unconvincingly stating that the more straightforward query (hereafter called Query 1) causes "poor index utilization," while the supposedly optimized query (hereafter called Query 2) presumably does not.

  2. The post doesn't even state which of the 100+ DBMSs* in the world he's running, each of which has unique performance characteristics. Always be skeptical of claims resembling, "In SQL, always do this thing to make it fast," without qualifying the claim with a specific DBMS. It is like saying, "In vehicles with an internal combustion engine, always do this to make them run fast." For starters, what kind of vehicle: car? truck? plane?

    It is worth mentioning, despite my skepticism, that it's quite possible this trick did work for some DBMS. But which one? Tell us! My intuition tells me there's no way the claim is universally true, but it could be accurate for some specific version of some specific DBMS (in a specific schema, specific data set, specific configuration, etc.) The problem I have is that it's presented as if the trick applies to all of "SQL." (Whatever that is; SQL doesn't exist in a vacuum, remember!) To be fair, it didn't explicitly say "all of SQL," but IMHO, it is implied.

    Also worth mentioning, a commenter asked which DBMS, but the original poster didn't reply.

    *I am using "DBMS" as a generic, catch-all term of choice to refer to a relational DBMS or any other SQL engine

  3. There is no evidence presented to convince the reader that Query 2 is faster: no timing data; no execution plans. There is no demonstration of the "poor index utilization" of Query 1. The poster just said it's real and left it at that. 🤷 One commenter did ask to see some evidence, and the poster pushed back. 🤷🤷

  4. Although we can make some reasonable assumptions - e.g. id is probably the sole primary key column, and it's an integer type - nothing is mentioned about the table definition, number of rows, width of the rows, the table's physical structure (i.e. is it a clustered index?) Does it have indexes? And so on.
     
  5.  The final, and perhaps most important, reason I'm not a fan of the post is this:

    Great (/s) - so 926 reactions, most of which are positive. In other words, a lot of folks accepting this performance trick as factual. Also notice the 51 reposts. Way to spread yet another new myth. 🤦🏻‍♂️

That said, is it actually a myth? Could it be possible that despite being not the best-written post, Query 2 does actually run faster than Query 1 on some DBMSs?

When in doubt, test it out!

But here's the thing - I can't test it on all 100+ SQL DBMSs in the world. Who can? (Maybe somebody with a really cool test suite set up?) What I can do is test it on a recent version of an extremely popular DBMS.

Below is a test case on MySQL 8.0, using reasonable assumptions, that demonstrates Query 2 is not faster, so it disproves the claim...at least for my specific test in MySQL 8.0. The test below, of course, does not prove anything about Oracle or Snowflake, or Microsoft or any other DBMS.

But all it really takes to disprove a universal claim is a single counterexample, and MySQL 8.0 isn't exactly an obscure DBMS. So I consider the claim to be a myth. ("BUSTED," if you want. :)) 

I should put my money where my mouth is and show my tests. So, here they are! 

MySQL 8.0 Tests: 

mysql> /* setup */
mysql> CREATE TABLE users (id INT AUTO_INCREMENT PRIMARY KEY, name VARCHAR(100), salary INT);
Query OK, 0 rows affected (0.01 sec)


mysql> /* generate 10 million dummy rows */
mysql> INSERT INTO users (name, salary)
    -> SELECT RAND(), RAND()*10000 FROM information_schema.columns c1, information_schema.columns c2, information_schema.columns c3
    -> LIMIT 10000000;
Query OK, 10000000 rows affected (1 min 39.74 sec)
Records: 10000000  Duplicates: 0  Warnings: 0

For the purpose of this demonstration, I am searching by 10,000 IDs, from a 10 million row table. If I search by a mere five IDs, both queries execute so fast that the MySQL CLI (my client) reports both as running in 0.00 seconds flat. So I needed to scale up the search to the point at which there can be a measurable difference.

Query 1:

mysql> EXPLAIN ANALYZE
    -> SELECT * FROM users WHERE id IN (9462823,6458594,3326348, <snip>); 

Query 1 Average Time to Execute (6 trials): 122 ms 

Show details
  • Trial 1: 200 ms
  • Trial 2: 160 ms
  • Trial 3: 140 ms
  • Trial 4: 80 ms
  • Trial 5: 90 ms
  • Trial 6: 60 ms
  • (MySQL cli used, which rounds to hundredths of a second)

 

Query 1 Sample Execution Plan:

-> Filter: (users.id in (9462823,6458594,3326348,<snip>))  (cost=10122 rows=9994) (actual time=0.0575..43.9 rows=10000 loops=1)
    -> Index range scan on users using PRIMARY over (id = 741) OR (id = 1245) OR (9998 more)  (cost=4502 rows=10000) (actual time=0.0525..41.9 rows=10000 loops=1)

 
Note MySQL used an index range scan on users using PRIMARY. that means it searched essentially for a whole batch of the id values using the primary index in one scan, which you can't really hope to do much better than. Absolutely not "poor index utilization."

Query 2:

mysql> WITH ids(id) AS (
    ->     /* should use row constructors IMO, but this is how OOP did it */
    ->     SELECT 9462823 UNION ALL
    ->     SELECT 6458594 UNION ALL
    ->     SELECT 3326348 UNION ALL
    ->     <snip>
    -> )
    -> SELECT u.*
    -> FROM users u
    -> JOIN ids ON u.id = ids.id;


Query 2 Average Time to Execute (6 trials run): 197 ms 

Show details
  • Trial 1: 240 ms
  • Trial 2: 230 ms
  • Trial 3: 220 ms
  • Trial 4: 200 ms
  • Trial 5: 150 ms
  • Trial 6: 140 ms
  • (MySQL cli used, which rounds to hundredths of a second)

 

Query 2 Sample Execution Plan:

-> Nested loop inner join  (cost=11516 rows=10000) (actual time=7.88..70.8 rows=9955 loops=1)
    -> Table scan on ids  (cost=2304..2432 rows=10000) (actual time=7.71..8.19 rows=10000 loops=1)
        -> Materialize union CTE ids  (cost=2304..2304 rows=10000) (actual time=7.71..7.71 rows=10000 loops=1)
            -> Rows fetched before execution  (cost=0..0 rows=1) (actual time=166e-6..208e-6 rows=1 loops=1)
            -> Rows fetched before execution  (cost=0..0 rows=1) (actual time=167e-6..209e-6 rows=1 loops=1)
            -> Rows fetched before execution  (cost=0..0 rows=1) (actual time=42e-6..84e-6 rows=1 loops=1)
            <snip>
(cost=0..0 rows=1) (actual time=0..41e-6 rows=1 loops=1)
    -> Filter: (u.id = ids.id)  (cost=0.25 rows=1) (actual time=0.00218..0.00226 rows=1 loops=10000)
        -> Single-row index lookup on u using PRIMARY (id = ids.id)  (cost=0.25 rows=1) (actual time=0.00207..0.0021 rows=1 loops=10000)

What this plan tells us in plain English:

  1. Materialize the CTE ids (basically think like making a temp table) based on all those UNION ALL'ed constants (each of those tiny UNION ALL statements is represented by "Rows fetched before execution")
  2. Table scan ids, and for each id, do a single-row index lookup on the primary index. Notice this is done in a Nested loop inner join. Essentially: for each id in ids, do a single-row index lookup on primary index.

Conclusion: 

All in all, Query 2 didn't perform badly (avg 197 ms), but definitely worse than Query 1 (avg 122 ms), a relative difference of about 38%, which is pretty significant. (I actually wonder if I replaced all those UNION ALLs in Query 2 with row constructors, could that help the performance? But I'm not going to explore that today.)

Aside from performance claims, the post also claims Query 2 provides:

  • "single query execution": sure it's one query, but so is Query 1. Huh???
  • Readability: subjective, but I'm not seeing it in this instance, and I say this as a CTE fanboy. (Query 2 uses a CTE, i.e. the WITH clause.)
  • Portability: Either of these queries should successfully parse and execute on pretty much any modern SQL engine, but if you came up to me and said "I know a SQL DBMS that can only handle one of these two queries -- which is it? Dollars to doughnuts it would be Query 1. (I think of how MySQL didn't even support CTEs until MySQL 8!)

OK, end of rant for today! But yeah, sometimes I need to stay away from SQL chatter on the internet. There's all kinds of this stuff that drives me crazy. 

Edit Weeks Later: Now Let's Test PostgreSQL 17.4! 

I thought I might enhance this post by repeating the same MySQL tests, but on Postgres 17.4 instead! See below.
 
postgres=# CREATE TABLE users (id INT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY, name VARCHAR(100), salary INT);
CREATE TABLE

postgres=# INSERT INTO users (name, salary)
postgres-# SELECT RANDOM(), RANDOM()*10000
postgres-# FROM generate_series(1, 10000000);
INSERT 0 10000000
Time: 38705.863 ms (00:38.706)
 

Query 1, Postgres:

postgres=# EXPLAIN ANALYZE
postgres-# SELECT * FROM users WHERE id IN (1322614, 8993900, 6548546, <snip>);
 

Query 1, Postgres, Average Time to Execute (6 trials): 103 ms 

Show details
  • Trial 1: 11.5 ms (planning) + 99.0 ms (execution) = 110.5 ms
  • Trial 2: 7.9 ms (planning) + 156.6 ms (execution) = 164.5 ms
  • Trial 3: 8.6 ms (planning) 60.7 ms (execution) = 69.3 ms
  • Trial 4: 11.1 ms (planning) 92.5 ms (execution) = 103.6 ms
  • Trial 5: 6.4 (planning) 70.4 ms (execution) = 76.8 ms
  • Trial 6: 11.6 (planning) 88.1 ms (execution) = 99.7 ms

Query 1, Postgres, Sample Execution Plan:

 Index Scan using users_pkey on users  (cost=0.43..35565.89 rows=10000 width=27) (actual time=3.200..88.645 rows=10000 loops=1)
   Index Cond: (id = ANY ('{1322614,8993900,6548546,3547394,9242769,4281742,4334656, <snip>}'::integer[]))

Query 2, Postgres:

postgres=# EXPLAIN ANALYZE
postgres-# WITH ids(id) AS (
postgres(#     SELECT 1322614 UNION ALL
postgres(#     SELECT 8993900 UNION ALL
postgres(#     SELECT 6548546 UNION ALL
postgres(# <snip>
postgres(# )
postgres-# SELECT u.*
postgres-# FROM users u
postgres-# JOIN ids ON u.id = ids.id;


Query 2, Postgres, Average Time to Execute  (6 trials): 415.5 ms 

Show details
  • Trial 1: 170.6 ms (planning) + 244.0 ms (execution) = 414.6 ms
  • Trial 2: 180.9 ms (planning) + 284.4 ms (execution) = 465.3 ms
  • Trial 3: 169.9 ms (planning) + 220.1 ms (execution) = 389.1 ms
  • Trial 4: 168.5 ms (planning) + 214.2 ms (execution) = 382.7 ms
  • Trial 5: 176.6 ms (planning) + 268.1 ms (execution) = 444.7 ms
  • Trial 6: 174.8 ms (planning) + 219.3 ms (execution) = 394.1 ms

Query 2, Postgres, Sample Execution Plan:

 Gather  (cost=1000.43..2057.59 rows=10000 width=27) (actual time=25.238..216.174 rows=10000 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Nested Loop  (cost=0.43..57.59 rows=4167 width=27) (actual time=0.052..87.485 rows=3333 loops=3)
         ->  Parallel Append  (cost=0.00..50.00 rows=1 width=4) (actual time=0.002..3.282 rows=3333 loops=3)
               ->  Result  (cost=0.00..0.01 rows=1 width=4) (actual time=0.001..0.001 rows=1 loops=1)
               ->  Result  (cost=0.00..0.01 rows=1 width=4) (actual time=0.000..0.000 rows=1 loops=1)
               ->  Result  (cost=0.00..0.01 rows=1 width=4) (actual time=0.001..0.001 rows=1 loops=1)
               <snip>
         ->  Index Scan using users_pkey on users u  (cost=0.43..7.58 rows=1 width=27) (actual time=0.025..0.025 rows=1 loops=10000)
               Index Cond: (id = (1322614))



⚠️ On Postgres, my first attempt at running Query 2 failed because:

ERROR:  stack depth limit exceeded
HINT:  Increase the configuration parameter "max_stack_depth" (currently 2048kB), after ensuring the platform's stack depth limit is adequate.


This is an issue when stringing all those SELECTs of literals separated by UNION ALLs, which nobody should be doing in 2025 unless your DBMS of choice does not support row constructors (Postgres does). My workaround for the purposes of running these tests was to add this to the top of my script:

set max_stack_depth = '7664kB';
 
7664kB is the max allowed by Postgres, and it isn't that much larger than the default of 2048kB. I got lucky in this particular use case, but if that's the ceiling, the Query 2 approach appears to be much less scalable on Postgres.

 

Conclusion for Postgres: 

So in summary, this magic performance trick was also BUSTED on Postgres 17.4. Query 2 is actually much slower than Query 1 for this given test; it took almost exactly 4 times longer to run!
 
Notice the Postgres planner tried to use parallel processing for Query 2, which my Postgres instance is configured to cap at 2 workers at the moment. It's possible that if I tuned parallel query settings, it could do better than presented here. It's not impossible that it could beat Query 1 even, but I have to go with the results I have here. This is why "for this particular configuration" is important to qualify your tests with, in addition to DBMS/version/etc.
 
It may even be worth mentioning all the tests in this post were run on my 2021 M1 Macbook Pro. The results could vary if I ran it on another machine, or on some cloud database service such as Amazon RDS.
 
 

Overall Conclusion: 

The major point I'm trying to make here is not "they are wrong and I am right," but rather, it is important to communicate about SQL performance more carefully! Always, with respect to performance quirks, qualify a post with your DBMS of choice. Ideally, talk about version and more detail about the schema and data as well.
 
Always show the proof that one approach was slower or faster than the other. Show the respective timings, and ideally, the respective execution plans.  

Always be skeptical of these magic performance tricks you see on the internet. If they can't even specify DBMS, do not believe them! At least not by default. Make them put up or shut up. :) If they present the full test case and it supports their claim, great! But even then, the test shouldn't be generalized to all of "SQL." Rather, the test demonstrates something about a specific data set in a specific schema on a specific version of a specific DBMS product with a specific configuration, etc.
 
Question to the readers: what do you guys think? Should I test the claim on more DBMSs? Am I being fair? Do the tests seem valid? Let me know what you think in the comments!
 
 

No comments: