Thursday, June 06, 2019

Helping a redditor with a SQL problem but then he vanished

A question asked on /r/sql follows. I typed up a big long answer then he deleted his post by the time I responded! So I'm going to post this here just so I didn't waste all my time typing it!

--*--BEGIN QUESTION--*--
In lieu of our internal guru being available I was wondering if anyone would be able to help me figure out how to query the data I need.

table1 - contract data

contractId subTaskId column1 column2 column_N
1 1 meta1 meta2 metaN
1 2 meta1 meta2 metaN
1 3 meta1 meta2 metaN
1 4 meta1 meta2 metaN
2 1 meta1 meta2 metaN
2 2 meta1 meta2 metaN
2 3 meta1 meta2 metaN
table2 - workflow tracking for each contract subTask

contractId subTaskId taskStep processStep column1 column_N
1 1 1 Processing meta1 metaN
1 1 2 Review meta1 metaN
1 1 3 Routing meta1 metaN
1 2 1 Processing meta1 metaN
1 2 2 Routing meta1 metaN
1 3 1 Processing meta1 metaN
1 3 2 Review meta1 metaN
1 3 3 Routing meta1 metaN
1 3 4 Final meta1 metaN
1 4 1 Processing meta1 metaN
1 4 2 Final meta1 metaN
2 1 1 Final meta1 metaN
2 2 1 Review meta1 metaN
2 3 1 Final meta1 metaN

Results

contractId subTaskId processStep table1.columns table2.columns
1 1 Routing table1.columns table2.columns
1 2 Routing table1.columns table2.columns
1 3 Final table1.columns table2.columns
1 4 Final table1.columns table2.columns
2 1 Final table1.columns table2.columns
2 2 Review table1.columns table2.columns
2 3 Final table1.columns table2.columns

The column1/2/N is just to represent that each table has a bunch of additional columns of metadata, most of which I'd like to be present in my results.

The following is about as close as I was able to get(newbie), but I couldn't figure out how to also bring in the data columns from table b.

SELECT
    a.*
FROM
    table1 a
    INNER JOIN
        (SELECT contractId, subTaskId, MAX(taskStep) AS taskStep
         FROM table2 GROUP BY contractId, subTaskId) AS b ON
        a.contractId = b.contractId
        AND a.subTaskId = b.subTaskId

Thank you!

--*--END QUESTION--*--

--*--BEGIN ANSWER--*--
Which database are you using? I'll assume Postgres because why not. :)

Others are asking what you're looking for because it isn't clear. My best guess is you want, per each contractId/subTaskId grouping, the row with the maximum taskStep within that grouping. The other columns come along for the ride.

In that case I'd say you're on the right track. Let's ignore the join for starters, to simplify the problem. I'll only look at table2.

So again, what you've started writing looks pretty good:

mwdb=# SELECT contractId, subTaskId, MAX(taskStep) AS taskStep
mwdb-#          FROM table2 GROUP BY contractId, subTaskId
mwdb-# ORDER BY contractId, subTaskId; --I added an ORDER BY so it looks nicer
 contractid | subtaskid | taskstep
------------+-----------+----------
          1 |         1 |        3
          1 |         2 |        2
          1 |         3 |        4
          1 |         4 |        2
          2 |         1 |        1
          2 |         2 |        1
          2 |         3 |        1
(7 rows)

Eyeballing these results and comparing to your expected output, so far looks good! It's the right number of rows and taskstep corresponds to the processStep you want. But how to pull in processStep?

I'll let you in on a little trick. When you use aggregate functions with a group by, you can't just (except in MySQL - but they cheat :)) pull in any arbitrary other column you want. See below:

mwdb=# SELECT contractId, subTaskId, MAX(taskStep) AS taskStep, processstep
mwdb-# FROM table2 GROUP BY contractId, subTaskId;
ERROR:  column "table2.processstep" must appear in the GROUP BY clause or be used in an aggregate function
LINE 1: ...contractId, subTaskId, MAX(taskStep) AS taskStep, processste...

But what if we slap an arbitrary aggregate function to bring it along for the ride? This is a trick I often use if the "along for the ride" column is the same for each combination of grouping columns. A quick demo:

--here's my test table
mwdb=# SELECT * FROM delete_me;
 a | b | c
---+---+---
 1 | 1 | 1
 1 | 2 | 1
 1 | 3 | 1
 2 | 1 | 1
 2 | 2 | 1
(5 rows)

--and here's my query that brings column c along for the ride
mwdb=# SELECT a, MAX(b) AS max_b, MIN(c) AS c FROM delete_me GROUP BY a;
 a | max_b | c
---+-------+---
 1 |     3 | 1
 2 |     2 | 1
(2 rows)

I could have easily used MAX(c) instead of MIN(c) - it doesn't really matter. BUT we cannot use this trick in our original problem. This is because processStep is not unique for a given combination of contractId, subTaskId. If we try this trick we will essentially be grabbing an arbitrary value of processStep, like so:

--WRONG
mwdb=# SELECT contractId, subTaskId, MAX(taskStep) AS taskStep, MIN(processstep) AS processStep
FROM table2 GROUP BY contractId, subTaskId
mwdb-# ORDER BY contractId, subTaskId;
 contractid | subtaskid | taskstep | processstep
------------+-----------+----------+-------------
          1 |         1 |        3 | Processing
          1 |         2 |        2 | Processing
          1 |         3 |        4 | Final
          1 |         4 |        2 | Final
          2 |         1 |        1 | Final
          2 |         2 |        1 | Review
          2 |         3 |        1 | Final
(7 rows)

Comparing this to your expected results, we can see that processStep is not always correct. So now what?

Often I've seen coders solve this problem by writing a query that scans the table twice like so:

mwdb=# SELECT table2.contractId, table2.subTaskId, table2.taskstep, table2.processstep                                                                                          FROM table2 JOIN (SELECT contractId, subTaskId, MAX(taskstep) max_taskstep FROM table2 GROUP BY contractId, subTaskId) max_table2 ON table2.contractId=max_table2.contractId AND table2.subTaskId=max_table2.subTaskId AND table2.taskstep = max_table2.max_taskstep
ORDER BY table2.contractId, table2.subTaskId, table2.taskstep, table2.processstep
mwdb-# ;
 contractid | subtaskid | taskstep | processstep
------------+-----------+----------+-------------
          1 |         1 |        3 | Routing
          1 |         2 |        2 | Routing
          1 |         3 |        4 | Final
          1 |         4 |        2 | Final
          2 |         1 |        1 | Final
          2 |         2 |        1 | Review
          2 |         3 |        1 | Final
(7 rows)

TA-DA!!! However, this is not the best solution. Like I said before, we have to scan the table twice, which you should avoid in general if possible. Besides it's not the most elegant solution out there. However it's perfectly acceptable if it meets your performance criteria. Maybe you don't care about performance at all for your purposes.

If we EXPLAIN ANALYZE this query, we can see proof that the Postgres query planner decided to scan table2 twice. In our tiny table it doesn't really matter, but what if we had a billion rows in the table?

mwdb=# EXPLAIN ANALYZE SELECT table2.contractId, table2.subTaskId, table2.taskstep, table2.processstep FROM table2 JOIN (SELECT contractId, subTaskId, MAX(taskstep) max_taskstep FROM table2 GROUP BY contractId, subTaskId) max_table2 ON table2.contractId=max_table2.contractId AND table2.subTaskId=max_table2.subTaskId AND table2.taskstep = max_table2.max_taskstep
mwdb-# ORDER BY table2.contractId, table2.subTaskId, table2.taskstep, table2.processstep;
                                                                          QUERY PLAN                                                                       
--------------------------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=100.38..100.39 rows=1 width=44) (actual time=0.137..0.138 rows=7 loops=1)
   Sort Key: table2.contractid, table2.subtaskid, table2.taskstep, table2.processstep
   Sort Method: quicksort  Memory: 25kB
   ->  Merge Join  (cost=90.56..100.37 rows=1 width=44) (actual time=0.107..0.120 rows=7 loops=1)
         Merge Cond: ((table2.contractid = table2_1.contractid) AND (table2.subtaskid = table2_1.subtaskid) AND (table2.taskstep = (max(table2_1.taskstep))))
         ->  Sort  (cost=55.27..57.22 rows=780 width=44) (actual time=0.071..0.071 rows=14 loops=1)
               Sort Key: table2.contractid, table2.subtaskid, table2.taskstep
               Sort Method: quicksort  Memory: 26kB
               ->  Seq Scan on table2  (cost=0.00..17.80 rows=780 width=44) (actual time=0.049..0.057 rows=14 loops=1)
         ->  Sort  (cost=35.29..35.79 rows=200 width=12) (actual time=0.032..0.032 rows=7 loops=1)
               Sort Key: table2_1.contractid, table2_1.subtaskid, (max(table2_1.taskstep))
               Sort Method: quicksort  Memory: 25kB
               ->  HashAggregate  (cost=23.65..25.65 rows=200 width=12) (actual time=0.017..0.017 rows=7 loops=1)
                     Group Key: table2_1.contractid, table2_1.subtaskid
                     ->  Seq Scan on table2 table2_1  (cost=0.00..17.80 rows=780 width=12) (actual time=0.002..0.004 rows=14 loops=1)
 Planning time: 0.388 ms
 Execution time: 0.503 ms
(17 rows)

So how do we avoid scanning the table twice? Enter our savior, window functions!!!

Let's throw away the group by, and instead use the row_number() over (partition by contractid, subtaskid order by taskstep desc) window function. That probably looks like random gibberish if you haven't seen it before, but it all it means is: let's generate a row number (1, 2, 3, 4, etc.) that resets to one for every combination of contractid, subtaskid, and start counting from the largest taskstep within that combo. Why do we want this number? Well if we start with the largest taskstep for every contractid, subtaskid combo, then we know that #1 is the one we want to keep. We can throw out the rest!

Let's see it in action:

mwdb=# SELECT contractid, subtaskid, taskstep, processstep, ROW_NUMBER() OVER(PARTITION BY contractid, subtaskid ORDER BY taskstep DESC) AS rn
FROM table2
ORDER BY contractId, subTaskId;

 contractid | subtaskid | taskstep | processstep | rn
------------+-----------+----------+-------------+----
          1 |         1 |        3 | Routing     |  1
          1 |         1 |        2 | Review      |  2
          1 |         1 |        1 | Processing  |  3
          1 |         2 |        2 | Routing     |  1
          1 |         2 |        1 | Processing  |  2
          1 |         3 |        4 | Final       |  1
          1 |         3 |        3 | Routing     |  2
          1 |         3 |        2 | Review      |  3
          1 |         3 |        1 | Processing  |  4
          1 |         4 |        2 | Final       |  1
          1 |         4 |        1 | Processing  |  2
          2 |         1 |        1 | Final       |  1
          2 |         2 |        1 | Review      |  1
          2 |         3 |        1 | Final       |  1
(14 rows)

Now the task from here is simple. Keep only the rows with rn=1!

mwdb=# SELECT contractid, subtaskid, taskstep, processstep FROM (
mwdb(#   SELECT contractid, subtaskid, taskstep, processstep, ROW_NUMBER() OVER(PARTITION BY contractid, subtaskid ORDER BY taskstep DESC) AS rn
mwdb(#   FROM table2
mwdb(# ) sub
mwdb-# WHERE rn=1
mwdb-# ORDER BY contractId, subTaskId;
 contractid | subtaskid | taskstep | processstep
------------+-----------+----------+-------------
          1 |         1 |        3 | Routing
          1 |         2 |        2 | Routing
          1 |         3 |        4 | Final
          1 |         4 |        2 | Final
          2 |         1 |        1 | Final
          2 |         2 |        1 | Review
          2 |         3 |        1 | Final
(7 rows)

And that's your final answer! But before we go, let's prove that it's not scanning the table twice:

mwdb=# EXPLAIN ANALYZE SELECT contractid, subtaskid, taskstep, processstep FROM (SELECT contractid, subtaskid, taskstep, processstep, ROW_NUMBER() OVER(PARTITION BY contractid, subtaskid ORDER BY taskstep DESC) AS rn FROM table2) sub WHERE rn=1 ORDER BY contractId, subTaskId;

                                                      QUERY PLAN                                                     
-----------------------------------------------------------------------------------------------------------------------
 Subquery Scan on sub  (cost=55.27..82.57 rows=4 width=44) (actual time=0.069..0.096 rows=7 loops=1)
   Filter: (sub.rn = 1)
   Rows Removed by Filter: 7
   ->  WindowAgg  (cost=55.27..72.82 rows=780 width=44) (actual time=0.064..0.086 rows=14 loops=1)
         ->  Sort  (cost=55.27..57.22 rows=780 width=44) (actual time=0.054..0.055 rows=14 loops=1)
               Sort Key: table2.contractid, table2.subtaskid, table2.taskstep DESC
               Sort Method: quicksort  Memory: 26kB
               ->  Seq Scan on table2  (cost=0.00..17.80 rows=780 width=44) (actual time=0.008..0.013 rows=14 loops=1)
 Planning time: 0.197 ms
 Execution time: 0.168 ms
(10 rows)

I'll clean the formatting up on this post tomorrow perhaps. It's late and I have to go to bed. :)

Last-second note: To be fair, it did do a "subquery scan on sub" to get just the rn=1 rows, in addition to the "seq scan" on table2. The first query did two "seq scans", so in a sense, our new query did scan every row twice. So is this any better? Well in most real world cases, the window function query is still a better bet. You would often have a where clause on the "sub" query, and best to avoid having to filter the data twice. (Imagine a billion rows in the table and a complicated where clause...)

--*--END ANSWER--*--

No comments: