Saturday, March 05, 2016

Dan Martensen's blog: SQL Performance of Join and Where Exists

Here's a neat little article on the blog of a fellow named Dan Martensen. I really enjoy performance experiments that show how one approach fares vs. another - it's something I do myself as well. :) This one shows how "where exists" performs vs. an inner join in Postgres 9.5.

Here is the article.

It turns out "where exists" is faster, because as the explain plan reveals, a join performs a Merge Join step, while "where exists" can get away with a Merge Semi Join step. What do these steps mean? Let's turn to the documentation (Source):
  • merge join: Each relation is sorted on the join attributes before the join starts. Then the two relations are scanned in parallel, and matching rows are combined to form join rows. This kind of join is more attractive because each relation has to be scanned only once. The required sorting might be achieved either by an explicit sort step, or by scanning the relation in the proper order using an index on the join key. 
As for Merge Semi Join, I can't find such a clear definition in the docs, but Dan Martensen's article (that I linked to above) mentions: "It’s worth noting Where Exists doesn’t need to return a value to the parent query in the SELECT. It’s WHERE clause compares parent to subquery ids using a boolean short circuit early exit."

So maybe that's a clue as for what goes on under the hood -- it just checks for the matching rows' existence then quits. No data retrieval required.

Perhaps I will soon check if MySQL behaves similarly, as my work will involve more MySQL soon!

No comments: