Saturday, June 04, 2016

Cumulative row count over date ranges

I recently helped a coworker (Mr. M) with his SQL query that gets a cumulative count over a set of ranges of dates. He wrote a query that functionally works just fine, but he didn't like it because there was a lot of repetition happening. It looked something like:

SELECT [date1], COUNT(*) FROM MYTABLE WHERE CREATED_ON <= [date1] UNION ALL
SELECT [date2], COUNT(*) FROM MYTABLE WHERE CREATED_ON <= [date2] UNION ALL
SELECT [date3], COUNT(*) FROM MYTABLE WHERE CREATED_ON <= [date3] UNION ALL
SELECT [date4], COUNT(*) FROM MYTABLE WHERE CREATED_ON <= [date4] UNION ALL
SELECT [date5], COUNT(*) FROM MYTABLE WHERE CREATED_ON <= [date5]

Imagine in place of [datex] there is some mathy date expression -- I'm leaving it out for now (because I'm lazy) but the idea was subtract a week from today for date1, subtract 14 days from today for date2, etc.

So Mr. M did not like that his copy/pasted query parts violated DRY - Don't Repeat Yourself. I think that's less bad than all this UNIONing and repeated checking against CREATED_ON. It's running 5 different queries that are kind of redundant as they are looking up by the same column repeatedly. How can we improve this query? My first instinct was to use windowing functions, but alas, this is MySQL which lacks them (remember, MySQL is a hair band). I'm sure we can do better, but here's what I came up with as a first hack:

SELECT SUM(bucket1) AS week1,
 SUM(bucket2) AS week2,
 SUM(bucket3) AS week3,
 SUM(bucket4) AS week4,
 SUM(bucket5) as week5
FROM (
 SELECT CASE WHEN CREATED_ON <= ADDDATE(current_date(), INTERVAL -1-(0*7) DAY) THEN 1 ELSE 0 END AS bucket1,
 CASE WHEN CREATED_ON <= ADDDATE(current_date(), INTERVAL -1-(1*7) DAY) THEN 1 ELSE 0 END AS bucket2,
 CASE WHEN CREATED_ON <= ADDDATE(current_date(), INTERVAL -1-(2*7) DAY) THEN 1 ELSE 0 END AS bucket3,
 CASE WHEN CREATED_ON <= ADDDATE(current_date(), INTERVAL -1-(3*7) DAY) THEN 1 ELSE 0 END AS bucket4,
 CASE WHEN CREATED_ON <= ADDDATE(current_date(), INTERVAL -1-(4*7) DAY) THEN 1 ELSE 0 END AS bucket5
FROM 
MYTABLE) sub

Well, I wasn't able to remove copy/pasted repetition, but I do think removing the UNION ALLs is a positive. What does this query do?

Step 1) The subquery. This is the bulk of the query, which passes over the whole table just once, and puts each row in an bucket from 1 through 5. It will perform 1 scan per row and 5 comparisons per row, vs. scanning each row 5 scans per row and 5 comparison per row.

Step 2) Sum them up!

Now in this specific instance the performance gain was pretty negligible, and we didn't really neaten the query all that much. I wouldn't blame Mr. M if he decides to stick with old query. As far as performance goes, the issue is this table was very small with only about 50k rows, therefore the cost of repeating 5 queries was negligible compared to all the added logic of those CASE WHEN expressions that I added.

Is there a way to avoid the repetition? 
The answer is, for MySQL, I don't know! But I wonder if I can figure something out with Postgres...

So first I made a sample table MYTABLE with 100,001 rows of randomly generated data.

To write the query, I decided to define the 5 dates using a common table expression, this way we can reference it as an imaginary table, so to speak:

mwrynn=# WITH weeks AS (SELECT now() - i * interval '1 week' AS date_point from generate_series(0,4) i)
SELECT * FROM weeks;

          date_point           
-------------------------------
 2016-06-04 13:14:00.560272-04
 2016-05-28 13:14:00.560272-04
 2016-05-21 13:14:00.560272-04
 2016-05-14 13:14:00.560272-04
 2016-05-07 13:14:00.560272-04
(5 rows)

Next, let's join that to our table so that for each date_point, get the rows where created_on is less than or equal to the date_point...

mwrynn=# WITH weeks AS (SELECT now() - i * interval '1 week' AS date_point from generate_series(0,4) i)
SELECT * FROM weeks JOIN mytable ON mytable.created_on <= weeks.date_point;

          date_point           | created_on 
-------------------------------+------------
 2016-06-04 13:15:40.869402-04 | 2015-12-12
 2016-06-04 13:15:40.869402-04 | 2016-02-19
 2016-06-04 13:15:40.869402-04 | 2016-04-20
 2016-06-04 13:15:40.869402-04 | 2016-02-01
[snipping tons of rows]

Final step: get the counts per date_point:

mwrynn=# WITH weeks AS (SELECT now() - i * interval '1 week' AS date_point from generate_series(0,4) i)
SELECT COUNT(*), date_point
FROM weeks JOIN
     mytable ON mytable.created_on <= weeks.date_point
GROUP BY date_point
ORDER BY date_point DESC;

 count  |          date_point          
--------+------------------------------
 100001 | 2016-06-04 13:17:29.33812-04
  98232 | 2016-05-28 13:17:29.33812-04
  96290 | 2016-05-21 13:17:29.33812-04
  94313 | 2016-05-14 13:17:29.33812-04
  92350 | 2016-05-07 13:17:29.33812-04
(5 rows)

TA-DA!!! WE DID IT! We got the cumulative counts without copying/pasting several queries or several conditions - we just had to define the weeks we concerned about (weeks 0 through 4) in the call to generate_series(), put that in a CTE, and query that with a join... [Edit: Note, we should even be able to parameterize the number of weeks by doing something like generate_series(0,?) -- this way we could easily use any number of weeks.] My initial instinct of requiring window functions was wrong. But a big THANK YOU TO COMMON TABLE EXPRESSIONS. :) 

Question: Does anyone know how to do this in MySQL??

Day-After Edit: I guess a similar technique could be used in MySQL, only we must replace the CTE with a subquery, and replace generate_series with some way of selecting 0-4.... I guess this would work: (SELECT 0 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4) - but perhaps this would count as the type of repetition we want to avoid...Another edit: apparently there is a solution, albeit an ugly one: http://stackoverflow.com/questions/6870499/generate-series-equivalent-in-mysql

No comments: