Mark W's SQL Blog: 2019

Thursday, October 03, 2019

PostgreSQL 12 released today!

PostgreSQL 12 was released today, and among many cool features, we have the much-awaited removal of the CTE optimization fence! Let's take a quick, hands-on look...

Please review my old post that described how it worked in previous versions: https://mwrynn.blogspot.com/2016/04/common-table-expressions-postgres-vs.html

In short, when you used a CTE, it materialized the results in memory, then ran the subsequent query against that.

You can still do that if you want in Postgres 12 with the "materialized" keyword, but you're not forced to.

So let's repeat the test I did 3 and a half years ago! Postgres 12 only - I won't repeat on Oracle, nor will I make fun of MySQL this time.

Here's what I've got:

mwrynn=# EXPLAIN ANALYZE

WITH peeps_by_state AS NOT MATERIALIZED (SELECT COUNT(*) AS cnt, state FROM person_location GROUP BY state)
SELECT cnt FROM peeps_by_state WHERE state=13;

QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------------------------------

Subquery Scan on peeps_by_state (cost=1000.00..484733.72 rows=50 width=8) (actual time=3687.870..3687.871 rows=1 loops=1)

-> Finalize GroupAggregate (cost=1000.00..484733.22 rows=50 width=12) (actual time=3687.869..3687.869 rows=1 loops=1)

Group Key: person_location.state

-> Gather (cost=1000.00..484732.22 rows=100 width=12) (actual time=3687.732..3694.247 rows=3 loops=1)

Workers Planned: 2

Workers Launched: 2

-> Partial GroupAggregate (cost=0.00..483722.22 rows=50 width=12) (actual time=3682.099..3682.099 rows=1 loops=3)

Group Key: person_location.state

-> Parallel Seq Scan on person_location (cost=0.00..481655.75 rows=413195 width=4) (actual time=0.448..3600.027 rows=333333 loops=3)

Filter: (state = 13)

Rows Removed by Filter: 16333333

Planning Time: 0.159 ms

Execution Time: 3694.318 ms

(13 rows)

Now let's use the "MATERIALIZED" keyword and compare.

mwrynn=# EXPLAIN ANALYZE

WITH peeps_by_state AS MATERIALIZED (SELECT COUNT(*) AS cnt, state FROM person_location GROUP BY state)
SELECT cnt FROM peeps_by_state WHERE state=13;

QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------------------------------------------

CTE Scan on peeps_by_state (cost=534753.70..534754.83 rows=1 width=8) (actual time=11828.277..11828.392 rows=1 loops=1)

Filter: (state = 13)

Rows Removed by Filter: 49

CTE peeps_by_state

-> Finalize GroupAggregate (cost=534741.03..534753.70 rows=50 width=12) (actual time=11828.191..11828.328 rows=50 loops=1)

Group Key: person_location.state

-> Gather Merge (cost=534741.03..534752.70 rows=100 width=12) (actual time=11828.181..11835.081 rows=150 loops=1)

Workers Planned: 2

Workers Launched: 2

-> Sort (cost=533741.01..533741.14 rows=50 width=12) (actual time=11818.697..11818.703 rows=50 loops=3)

Sort Key: person_location.state

Sort Method: quicksort Memory: 27kB

Worker 0: Sort Method: quicksort Memory: 27kB

Worker 1: Sort Method: quicksort Memory: 27kB

-> Partial HashAggregate (cost=533739.10..533739.60 rows=50 width=12) (actual time=11818.610..11818.624 rows=50 loops=3)

Group Key: person_location.state

-> Parallel Seq Scan on person_location (cost=0.00..429572.40 rows=20833340 width=4) (actual time=0.521..3374.912 rows=16666667 loops=3)

Planning Time: 0.300 ms

Execution Time: 11835.398 ms

(19 rows)

So there we have it - a significant difference! (Note that the times here cannot be compared to the times in the previous post, as it's a different server, different config, etc.)

I'll end the post here, but I may go over the steps in this execution plan and try to explain what's going on - maybe in a subsequent post or maybe I'll just edit this one.

Thursday, June 13, 2019

Clusters, Clusters Everywhere!

Clusters: what an overloaded term in the database world! Whenever someone talks to me about a database cluster, or a clustered index or clustered table, I have to stop and think for a moment to figure out what they mean. I am going to attempt to list all of the contexts in which the term “cluster” or “clustered” apply that I can think of, in no particular (or logical) order:

PostgreSQL Cluster: In Postgres, a cluster is simply your database storage area on disk. It is a collection of databases. The hierarchy is that a cluster has many databases that have many schemas that have many relations and other objects. A Postgres Cluster is more or less synonymous with a specific installation of Postgres.

Catalog Cluster: The SQL standard definition. Essentially the same as the PostgreSQL Cluster but generic. (I hadn’t actually heard of this one before except in the Postgres documentation in which they define a Postgres Cluster.)

Clustered Index and Non-clustered Index (MS SQL Server): In SQL Server, a clustered index refers to the table itself being physically shaped like an index. The index stores the entire row, sorted by key, in this index structure. A non-clustered index in Microsoft Land contains the key values in the index, but the rest of the row resides elsewhere in a heap or clustered table (see below)

MySQL (InnoDB) has something much like this too, but you don’t have a choice to use heap tables as an alternative. In Oracle they are called Index-Organized Tables and heap tables are the default.

Advantages: looking up by primary key is faster.

Disadvantages: inserts are slower.

There are more nuances, both pros and cons, but that's the gist of it. (One thing to consider is if your Primary Key is a UUID vs. a standard auto-increment integer, but I'll let you Google that. :))

Clustered Table (MS SQL Server): A table that has a clustered index (above) is called a clustered table.

Clustered Index (Postgres): An index that indicates the sort order of the table on which it is built. You first create your table as usual. Then create a clustered index on the columns you want to sort by. Populate the table. Then run the CLUSTER command to sort the table. Note that unlike the Microsoft clustered index above, a Postgres clustered index does NOT store the entire row's data in the index structure.

Advantages: this may speed up queries that use table scans with the sorted columns in the where clause. Check out the top answer on this stackexchange question - it talks about looking up by a date range on such a sorted table and the performance benefits - https://dba.stackexchange.com/questions/39589/optimizing-queries-on-a-range-of-timestamps-two-columns

Some more info from the PG documentation: "In cases where you are accessing single rows randomly within a table, the actual order of the data in the table is unimportant. However, if you tend to access some data more than others, and there is an index that groups them together, you will benefit from using CLUSTER. If you are requesting a range of indexed values from a table, or a single indexed value that has multiple rows that match, CLUSTER will help because once the index identifies the table page for the first row that matches, all other rows that match are probably already on the same table page, and so you save disk accesses and speed up the query."

Disadvantages: You need to maintain your sort order by periodically re-running the CLUSTER command, and a table lock is held on the table during the process.

I once had a summary table that was completely rebuilt every night during a maintenance period, and the CLUSTER command was run afterward. So in a case like this having to run the CLUSTER command is not much of an issue.

The CLUSTER command (also Postgres): This pairs up with the Clustered Index for Postgres above. It is simply the command you issue to perform the sort - i.e. to re-order the table according to the Clustered Index.

Table Clusters (Oracle): A group of tables that share common columns and store related data in the same blocks on disk. Essentially this is a way to physically colocate tables that are joined frequently, while maintaining the logical distinction between the tables. Table clusters can be used to reduce disk I/O and improve access times for joins. They can reduce storage requirements too, as the cluster key value is not stored repeatedly for each row.

You can also put just a single table in an Oracle Table Cluster - so all the data would be grouped by key in its physical home on disk. This may allow some queries to perform faster. See this old AskTom post.

Cluster Index (Oracle): A Cluster Index is used to support a Table Cluster (above). "To locate a row in a cluster, the cluster index is used to find the cluster key value, which points to the data block associated with that cluster key value. Therefore, Oracle accesses a given row with a minimum of two I/Os--possibly more, depending on the number of levels that must be traversed in the index." -Oracle docs

Hash Clusters (Oracle): (Oh dear lord, it keeps going with Oracle, doesn't it?) I'm just going to quote Tom Kyte on this one:

Hash clusters are useful when you always access the data by primary key. For example, say you have some data in a table T and you ALWAYS query:

select * from T where id = :x;

That might be a good candidate for a hash cluster since there will be no index needed. Oracle will hash the value of :x into a physical address and go right to the data. No index range scan, just a table access by the hash value

The “common sense” definition of a Cluster: this is what I think most people are talking about when they mention a “mysql cluster” or a “database cluster." They are talking about a group of several servers, each running a database service, probably with some sort of replication set up, working in tandem to achieve load balancing and high availability. This is more a general concept than a specific feature/technology.

Real Application Cluster (aka RAC): A trademark of the Oracle Corporation, is an Oracle-specific implementation of the “common sense” definition of a cluster. This feature lets you use many Oracle instances (an instance is a set of processes and memory areas) together in a “shared everything” architecture. Each server has its own copy of the database (database referring to the data files, essentially) and they are kept in sync automatically.

Confused yet?? :) Actually for me, writing out these definitions helps me become a little less confused. Hope it helps you, too! A question for my vast readership out there -- is there any other "cluster" I missed? I am sure there are many, many implementations of the "common sense" cluster, but that's ok, we don't need to list every one. :) Also I realize some of these features are supported by databases I didn't mention such as Sybase. Again, no need to list every one.

Ok, until next time, keep clustering! (Or something...)

Tuesday, June 11, 2019

My favorite subject...in the wild!

My friend Scott pointed out that the site boxofficemojo.com most likely does not use bind variables for its search feature - shame shame!!!

Here's an example of a "good" search. Let's look up The Godfather:

And it works! No surprises there... (Although they could probably work on their relevancy scores.)

Now let's look up the movie Breakin' (note the apostrophe):

...And click "Search"...

Doh!!! Guess the apostrophe breaks the query because they're concatenating user inputs.

Just bind!

Thursday, June 06, 2019

Yikes

There was a post where somebody was writing code that concatenated user inputs from a form into SQL...

Helping a redditor with a SQL problem but then he vanished

A question asked on /r/sql follows. I typed up a big long answer then he deleted his post by the time I responded! So I'm going to post this here just so I didn't waste all my time typing it!

--*--BEGIN QUESTION--*--
In lieu of our internal guru being available I was wondering if anyone would be able to help me figure out how to query the data I need.

table1 - contract data

contractId subTaskId column1 column2 column_N
1 1 meta1 meta2 metaN
1 2 meta1 meta2 metaN
1 3 meta1 meta2 metaN
1 4 meta1 meta2 metaN
2 1 meta1 meta2 metaN
2 2 meta1 meta2 metaN
2 3 meta1 meta2 metaN
table2 - workflow tracking for each contract subTask

contractId subTaskId taskStep processStep column1 column_N
1 1 1 Processing meta1 metaN
1 1 2 Review meta1 metaN
1 1 3 Routing meta1 metaN
1 2 1 Processing meta1 metaN
1 2 2 Routing meta1 metaN
1 3 1 Processing meta1 metaN
1 3 2 Review meta1 metaN
1 3 3 Routing meta1 metaN
1 3 4 Final meta1 metaN
1 4 1 Processing meta1 metaN
1 4 2 Final meta1 metaN
2 1 1 Final meta1 metaN
2 2 1 Review meta1 metaN
2 3 1 Final meta1 metaN

Results

contractId subTaskId processStep table1.columns table2.columns
1 1 Routing table1.columns table2.columns
1 2 Routing table1.columns table2.columns
1 3 Final table1.columns table2.columns
1 4 Final table1.columns table2.columns
2 1 Final table1.columns table2.columns
2 2 Review table1.columns table2.columns
2 3 Final table1.columns table2.columns

The column1/2/N is just to represent that each table has a bunch of additional columns of metadata, most of which I'd like to be present in my results.

The following is about as close as I was able to get(newbie), but I couldn't figure out how to also bring in the data columns from table b.

SELECT
a.*
FROM
table1 a
INNER JOIN
(SELECT contractId, subTaskId, MAX(taskStep) AS taskStep
FROM table2 GROUP BY contractId, subTaskId) AS b ON
a.contractId = b.contractId
AND a.subTaskId = b.subTaskId

Thank you!

--*--END QUESTION--*--

--*--BEGIN ANSWER--*--
Which database are you using? I'll assume Postgres because why not. :)

Others are asking what you're looking for because it isn't clear. My best guess is you want, per each contractId/subTaskId grouping, the row with the maximum taskStep within that grouping. The other columns come along for the ride.

In that case I'd say you're on the right track. Let's ignore the join for starters, to simplify the problem. I'll only look at table2.

So again, what you've started writing looks pretty good:

mwdb=# SELECT contractId, subTaskId, MAX(taskStep) AS taskStep
mwdb-# FROM table2 GROUP BY contractId, subTaskId
mwdb-# ORDER BY contractId, subTaskId; --I added an ORDER BY so it looks nicer
contractid | subtaskid | taskstep
------------+-----------+----------
1 | 1 | 3
1 | 2 | 2
1 | 3 | 4
1 | 4 | 2
2 | 1 | 1
2 | 2 | 1
2 | 3 | 1
(7 rows)

Eyeballing these results and comparing to your expected output, so far looks good! It's the right number of rows and taskstep corresponds to the processStep you want. But how to pull in processStep?

I'll let you in on a little trick. When you use aggregate functions with a group by, you can't just (except in MySQL - but they cheat :)) pull in any arbitrary other column you want. See below:

mwdb=# SELECT contractId, subTaskId, MAX(taskStep) AS taskStep, processstep
mwdb-# FROM table2 GROUP BY contractId, subTaskId;
ERROR: column "table2.processstep" must appear in the GROUP BY clause or be used in an aggregate function
LINE 1: ...contractId, subTaskId, MAX(taskStep) AS taskStep, processste...

But what if we slap an arbitrary aggregate function to bring it along for the ride? This is a trick I often use if the "along for the ride" column is the same for each combination of grouping columns. A quick demo:

--here's my test table
mwdb=# SELECT * FROM delete_me;
a | b | c
---+---+---
1 | 1 | 1
1 | 2 | 1
1 | 3 | 1
2 | 1 | 1
2 | 2 | 1
(5 rows)

--and here's my query that brings column c along for the ride
mwdb=# SELECT a, MAX(b) AS max_b, MIN(c) AS c FROM delete_me GROUP BY a;
a | max_b | c
---+-------+---
1 | 3 | 1
2 | 2 | 1
(2 rows)

I could have easily used MAX(c) instead of MIN(c) - it doesn't really matter. BUT we cannot use this trick in our original problem. This is because processStep is not unique for a given combination of contractId, subTaskId. If we try this trick we will essentially be grabbing an arbitrary value of processStep, like so:

--WRONG
mwdb=# SELECT contractId, subTaskId, MAX(taskStep) AS taskStep, MIN(processstep) AS processStep
FROM table2 GROUP BY contractId, subTaskId
mwdb-# ORDER BY contractId, subTaskId;
contractid | subtaskid | taskstep | processstep
------------+-----------+----------+-------------
1 | 1 | 3 | Processing
1 | 2 | 2 | Processing
1 | 3 | 4 | Final
1 | 4 | 2 | Final
2 | 1 | 1 | Final
2 | 2 | 1 | Review
2 | 3 | 1 | Final
(7 rows)

Comparing this to your expected results, we can see that processStep is not always correct. So now what?

Often I've seen coders solve this problem by writing a query that scans the table twice like so:

mwdb=# SELECT table2.contractId, table2.subTaskId, table2.taskstep, table2.processstep FROM table2 JOIN (SELECT contractId, subTaskId, MAX(taskstep) max_taskstep FROM table2 GROUP BY contractId, subTaskId) max_table2 ON table2.contractId=max_table2.contractId AND table2.subTaskId=max_table2.subTaskId AND table2.taskstep = max_table2.max_taskstep
ORDER BY table2.contractId, table2.subTaskId, table2.taskstep, table2.processstep
mwdb-# ;
contractid | subtaskid | taskstep | processstep
------------+-----------+----------+-------------
1 | 1 | 3 | Routing
1 | 2 | 2 | Routing
1 | 3 | 4 | Final
1 | 4 | 2 | Final
2 | 1 | 1 | Final
2 | 2 | 1 | Review
2 | 3 | 1 | Final
(7 rows)

TA-DA!!! However, this is not the best solution. Like I said before, we have to scan the table twice, which you should avoid in general if possible. Besides it's not the most elegant solution out there. However it's perfectly acceptable if it meets your performance criteria. Maybe you don't care about performance at all for your purposes.

If we EXPLAIN ANALYZE this query, we can see proof that the Postgres query planner decided to scan table2 twice. In our tiny table it doesn't really matter, but what if we had a billion rows in the table?

mwdb=# EXPLAIN ANALYZE SELECT table2.contractId, table2.subTaskId, table2.taskstep, table2.processstep FROM table2 JOIN (SELECT contractId, subTaskId, MAX(taskstep) max_taskstep FROM table2 GROUP BY contractId, subTaskId) max_table2 ON table2.contractId=max_table2.contractId AND table2.subTaskId=max_table2.subTaskId AND table2.taskstep = max_table2.max_taskstep
mwdb-# ORDER BY table2.contractId, table2.subTaskId, table2.taskstep, table2.processstep;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=100.38..100.39 rows=1 width=44) (actual time=0.137..0.138 rows=7 loops=1)
Sort Key: table2.contractid, table2.subtaskid, table2.taskstep, table2.processstep
Sort Method: quicksort Memory: 25kB
-> Merge Join (cost=90.56..100.37 rows=1 width=44) (actual time=0.107..0.120 rows=7 loops=1)
Merge Cond: ((table2.contractid = table2_1.contractid) AND (table2.subtaskid = table2_1.subtaskid) AND (table2.taskstep = (max(table2_1.taskstep))))
-> Sort (cost=55.27..57.22 rows=780 width=44) (actual time=0.071..0.071 rows=14 loops=1)
Sort Key: table2.contractid, table2.subtaskid, table2.taskstep
Sort Method: quicksort Memory: 26kB
-> Seq Scan on table2 (cost=0.00..17.80 rows=780 width=44) (actual time=0.049..0.057 rows=14 loops=1)
-> Sort (cost=35.29..35.79 rows=200 width=12) (actual time=0.032..0.032 rows=7 loops=1)
Sort Key: table2_1.contractid, table2_1.subtaskid, (max(table2_1.taskstep))
Sort Method: quicksort Memory: 25kB
-> HashAggregate (cost=23.65..25.65 rows=200 width=12) (actual time=0.017..0.017 rows=7 loops=1)
Group Key: table2_1.contractid, table2_1.subtaskid
-> Seq Scan on table2 table2_1 (cost=0.00..17.80 rows=780 width=12) (actual time=0.002..0.004 rows=14 loops=1)
Planning time: 0.388 ms
Execution time: 0.503 ms
(17 rows)

So how do we avoid scanning the table twice? Enter our savior, window functions!!!

Let's throw away the group by, and instead use the row_number() over (partition by contractid, subtaskid order by taskstep desc) window function. That probably looks like random gibberish if you haven't seen it before, but it all it means is: let's generate a row number (1, 2, 3, 4, etc.) that resets to one for every combination of contractid, subtaskid, and start counting from the largest taskstep within that combo. Why do we want this number? Well if we start with the largest taskstep for every contractid, subtaskid combo, then we know that #1 is the one we want to keep. We can throw out the rest!

Let's see it in action:

mwdb=# SELECT contractid, subtaskid, taskstep, processstep, ROW_NUMBER() OVER(PARTITION BY contractid, subtaskid ORDER BY taskstep DESC) AS rn
FROM table2
ORDER BY contractId, subTaskId;

contractid | subtaskid | taskstep | processstep | rn
------------+-----------+----------+-------------+----
1 | 1 | 3 | Routing | 1
1 | 1 | 2 | Review | 2
1 | 1 | 1 | Processing | 3
1 | 2 | 2 | Routing | 1
1 | 2 | 1 | Processing | 2
1 | 3 | 4 | Final | 1
1 | 3 | 3 | Routing | 2
1 | 3 | 2 | Review | 3
1 | 3 | 1 | Processing | 4
1 | 4 | 2 | Final | 1
1 | 4 | 1 | Processing | 2
2 | 1 | 1 | Final | 1
2 | 2 | 1 | Review | 1
2 | 3 | 1 | Final | 1
(14 rows)

Now the task from here is simple. Keep only the rows with rn=1!

mwdb=# SELECT contractid, subtaskid, taskstep, processstep FROM (
mwdb(# SELECT contractid, subtaskid, taskstep, processstep, ROW_NUMBER() OVER(PARTITION BY contractid, subtaskid ORDER BY taskstep DESC) AS rn
mwdb(# FROM table2
mwdb(# ) sub
mwdb-# WHERE rn=1
mwdb-# ORDER BY contractId, subTaskId;
contractid | subtaskid | taskstep | processstep
------------+-----------+----------+-------------
1 | 1 | 3 | Routing
1 | 2 | 2 | Routing
1 | 3 | 4 | Final
1 | 4 | 2 | Final
2 | 1 | 1 | Final
2 | 2 | 1 | Review
2 | 3 | 1 | Final
(7 rows)

And that's your final answer! But before we go, let's prove that it's not scanning the table twice:

mwdb=# EXPLAIN ANALYZE SELECT contractid, subtaskid, taskstep, processstep FROM (SELECT contractid, subtaskid, taskstep, processstep, ROW_NUMBER() OVER(PARTITION BY contractid, subtaskid ORDER BY taskstep DESC) AS rn FROM table2) sub WHERE rn=1 ORDER BY contractId, subTaskId;

QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Subquery Scan on sub (cost=55.27..82.57 rows=4 width=44) (actual time=0.069..0.096 rows=7 loops=1)
Filter: (sub.rn = 1)
Rows Removed by Filter: 7
-> WindowAgg (cost=55.27..72.82 rows=780 width=44) (actual time=0.064..0.086 rows=14 loops=1)
-> Sort (cost=55.27..57.22 rows=780 width=44) (actual time=0.054..0.055 rows=14 loops=1)
Sort Key: table2.contractid, table2.subtaskid, table2.taskstep DESC
Sort Method: quicksort Memory: 26kB
-> Seq Scan on table2 (cost=0.00..17.80 rows=780 width=44) (actual time=0.008..0.013 rows=14 loops=1)
Planning time: 0.197 ms
Execution time: 0.168 ms
(10 rows)

I'll clean the formatting up on this post tomorrow perhaps. It's late and I have to go to bed. :)

Last-second note: To be fair, it did do a "subquery scan on sub" to get just the rn=1 rows, in addition to the "seq scan" on table2. The first query did two "seq scans", so in a sense, our new query did scan every row twice. So is this any better? Well in most real world cases, the window function query is still a better bet. You would often have a where clause on the "sub" query, and best to avoid having to filter the data twice. (Imagine a billion rows in the table and a complicated where clause...)

--*--END ANSWER--*--

Friday, May 17, 2019

I was very pleased to see the above in the documentation for Psychopg (a PostgreSQL adapter for Python). I'm thinking of doing a presentation on the subject at my company, which shall remain nameless, because time and time again I see a lack of bind variables. I see it on my team; I see it on other teams. We have countless Jira tickets about various applications and other projects breaking due to complications from apostrophes and single quotes as input. (I'd post screenshots - it's really quite impressive - but I can't divulge that information.) Someone please make it stop! I guess that someone has to be me.

Thursday, March 14, 2019

Happy Pi Day!

(Borrowed from Connor McDonald's Twitter account.)

SQL> select
  sum((
  4 / (8*(level-1)+1) -
  2 / (8*(level-1)+4) -
  1 / (8*(level-1)+5) -
  1 / (8*(level-1)+6)
  ) / power(16,level-1)) pi
from dual connect by level <= 10;

        PI
----------
14159265

Sunday, February 17, 2019

"CTEs to no longer be an optimization fence (COMMITED) (git.postgresql.org)"

Saw this linked to on Reddit with the title I quoted in this post's title...

Great news in the world of Postgres: no longer will Common Table Expressions (CTE) be shackled to the evil CTE Optimization Fence! Linky

As I previously blogged about, the optimization fence is a drawback to CTEs in Postgres that in short potentially impacts performance when you use them. So you may use them to make a query neater but you can pay a penalty. If the new commits are all that they promise, this performance hit will be no more! Here's the link to my old post about them - linky - note that they should perform about as well Oracle now!

Sunday, January 06, 2019

Happy New Year 2019! It's that time again...

Yes folks, it's time for the database of the year award according to db-engines.com...And the winner is: https://db-engines.com/en/blog_post/79