Saturday, June 11, 2016

Neat graphical summarization of SQL joins


I'm not sure who to credit for it, I just saw it linked to on reddit with no source.... Note the first one (with the comma) is also called a CROSS JOIN.

Edit: Apologies to the color blind...

Generating a series of integers as a set of rows


In my last post, Cumulative row count over date ranges, I take advantage of a postgres function called generate_series(start, stop) - this generates a series of values, from start to stop. So, for example, generate_series(2,4) returns:

SELECT * FROM generate_series(1,4);
 generate_series
-----------------
               1 
               2
               3
               4

A question that’s been on my mind is how to achieve this in other database systems?

Oracle


Oracle has a way to generate a series as well, but it’s just not quite as simple as calling a function:

SELECT level i
FROM dual
CONNECT BY level <= 4

Less intuitive but it works nicely. What it’s doing is it uses the CONNECT BY clause on the dual table - this means the 1 column, 1 row dual table is being “connected” to itself recursively — this is called a hierarchical query (https://docs.oracle.com/database/121/SQLRF/queries003.htm). “Level” is a special pseudocolumn that tells you the level of recursion depth. So, while this method is more complicated, it does work nicely!

MySQL


The MySQL way….

I didn’t know how to achieve generate_series in MySQL off the top of my head, so off to google! Unfortunately the method I found depends on selecting from a table of at least n size, where n is the number of values you wish to generate. Join on a subquery that declares a variable @num, and in your SELECT clause, increment this value per row. I found this solution here, they are using dates but it can be simplified to generate integers. http://stackoverflow.com/questions/6870499/generate-series-equivalent-in-mysql

SELECT @num:=@num+1
FROM INFORMATION_SCHEMA.COLUMNS, (select @num:=0) num
LIMIT 4


+--------------+
| @num:=@num+1 |
+--------------+
|            1 |
|            2 |
|            3 |
|            4 |
+--------------+

4 rows in set, 1 warning (0.02 sec)

Works great but with the caveat that the "dummy" table you select from, in this example INFORMATION_SCHEMA.COLUMNS, must have at least as many rows as you want values. A little scary, in my opinion, to use in a generalized way with this limitation.

MS SQL Server


Oracle-like recursion can be done in SQL Server as well, and wouldn't you know it, some nice developer wrote a generate_series function for SQL Server and shared it on github

Why is generating a series useful, anyway?

There are probably more use cases for generating a series than I can come up with! But as in the previous post I linked to at the top of this one, a series can be useful for implementing something like a for loop. Join against the series as if it were another table, and use its value (call it i) as a parameter to a function, then voila, suddenly you're calling function f for each value of i: f(1), f(2), ..., f(n). This is a great way to create more elegant queries by avoiding copied/pasted repetition. (Again as see in my last post.) 

That's all for now!

Saturday, June 04, 2016

Cumulative row count over date ranges

I recently helped a coworker (Mr. M) with his SQL query that gets a cumulative count over a set of ranges of dates. He wrote a query that functionally works just fine, but he didn't like it because there was a lot of repetition happening. It looked something like:

SELECT [date1], COUNT(*) FROM MYTABLE WHERE CREATED_ON <= [date1] UNION ALL
SELECT [date2], COUNT(*) FROM MYTABLE WHERE CREATED_ON <= [date2] UNION ALL
SELECT [date3], COUNT(*) FROM MYTABLE WHERE CREATED_ON <= [date3] UNION ALL
SELECT [date4], COUNT(*) FROM MYTABLE WHERE CREATED_ON <= [date4] UNION ALL
SELECT [date5], COUNT(*) FROM MYTABLE WHERE CREATED_ON <= [date5]

Imagine in place of [datex] there is some mathy date expression -- I'm leaving it out for now (because I'm lazy) but the idea was subtract a week from today for date1, subtract 14 days from today for date2, etc.

So Mr. M did not like that his copy/pasted query parts violated DRY - Don't Repeat Yourself. I think that's less bad than all this UNIONing and repeated checking against CREATED_ON. It's running 5 different queries that are kind of redundant as they are looking up by the same column repeatedly. How can we improve this query? My first instinct was to use windowing functions, but alas, this is MySQL which lacks them (remember, MySQL is a hair band). I'm sure we can do better, but here's what I came up with as a first hack:

SELECT SUM(bucket1) AS week1,
 SUM(bucket2) AS week2,
 SUM(bucket3) AS week3,
 SUM(bucket4) AS week4,
 SUM(bucket5) as week5
FROM (
 SELECT CASE WHEN CREATED_ON <= ADDDATE(current_date(), INTERVAL -1-(0*7) DAY) THEN 1 ELSE 0 END AS bucket1,
 CASE WHEN CREATED_ON <= ADDDATE(current_date(), INTERVAL -1-(1*7) DAY) THEN 1 ELSE 0 END AS bucket2,
 CASE WHEN CREATED_ON <= ADDDATE(current_date(), INTERVAL -1-(2*7) DAY) THEN 1 ELSE 0 END AS bucket3,
 CASE WHEN CREATED_ON <= ADDDATE(current_date(), INTERVAL -1-(3*7) DAY) THEN 1 ELSE 0 END AS bucket4,
 CASE WHEN CREATED_ON <= ADDDATE(current_date(), INTERVAL -1-(4*7) DAY) THEN 1 ELSE 0 END AS bucket5
FROM 
MYTABLE) sub

Well, I wasn't able to remove copy/pasted repetition, but I do think removing the UNION ALLs is a positive. What does this query do?

Step 1) The subquery. This is the bulk of the query, which passes over the whole table just once, and puts each row in an bucket from 1 through 5. It will perform 1 scan per row and 5 comparisons per row, vs. scanning each row 5 scans per row and 5 comparison per row.

Step 2) Sum them up!

Now in this specific instance the performance gain was pretty negligible, and we didn't really neaten the query all that much. I wouldn't blame Mr. M if he decides to stick with old query. As far as performance goes, the issue is this table was very small with only about 50k rows, therefore the cost of repeating 5 queries was negligible compared to all the added logic of those CASE WHEN expressions that I added.

Is there a way to avoid the repetition? 
The answer is, for MySQL, I don't know! But I wonder if I can figure something out with Postgres...

So first I made a sample table MYTABLE with 100,001 rows of randomly generated data.

To write the query, I decided to define the 5 dates using a common table expression, this way we can reference it as an imaginary table, so to speak:

mwrynn=# WITH weeks AS (SELECT now() - i * interval '1 week' AS date_point from generate_series(0,4) i)
SELECT * FROM weeks;

          date_point           
-------------------------------
 2016-06-04 13:14:00.560272-04
 2016-05-28 13:14:00.560272-04
 2016-05-21 13:14:00.560272-04
 2016-05-14 13:14:00.560272-04
 2016-05-07 13:14:00.560272-04
(5 rows)

Next, let's join that to our table so that for each date_point, get the rows where created_on is less than or equal to the date_point...

mwrynn=# WITH weeks AS (SELECT now() - i * interval '1 week' AS date_point from generate_series(0,4) i)
SELECT * FROM weeks JOIN mytable ON mytable.created_on <= weeks.date_point;

          date_point           | created_on 
-------------------------------+------------
 2016-06-04 13:15:40.869402-04 | 2015-12-12
 2016-06-04 13:15:40.869402-04 | 2016-02-19
 2016-06-04 13:15:40.869402-04 | 2016-04-20
 2016-06-04 13:15:40.869402-04 | 2016-02-01
[snipping tons of rows]

Final step: get the counts per date_point:

mwrynn=# WITH weeks AS (SELECT now() - i * interval '1 week' AS date_point from generate_series(0,4) i)
SELECT COUNT(*), date_point
FROM weeks JOIN
     mytable ON mytable.created_on <= weeks.date_point
GROUP BY date_point
ORDER BY date_point DESC;

 count  |          date_point          
--------+------------------------------
 100001 | 2016-06-04 13:17:29.33812-04
  98232 | 2016-05-28 13:17:29.33812-04
  96290 | 2016-05-21 13:17:29.33812-04
  94313 | 2016-05-14 13:17:29.33812-04
  92350 | 2016-05-07 13:17:29.33812-04
(5 rows)

TA-DA!!! WE DID IT! We got the cumulative counts without copying/pasting several queries or several conditions - we just had to define the weeks we concerned about (weeks 0 through 4) in the call to generate_series(), put that in a CTE, and query that with a join... [Edit: Note, we should even be able to parameterize the number of weeks by doing something like generate_series(0,?) -- this way we could easily use any number of weeks.] My initial instinct of requiring window functions was wrong. But a big THANK YOU TO COMMON TABLE EXPRESSIONS. :) 

Question: Does anyone know how to do this in MySQL??

Day-After Edit: I guess a similar technique could be used in MySQL, only we must replace the CTE with a subquery, and replace generate_series with some way of selecting 0-4.... I guess this would work: (SELECT 0 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4) - but perhaps this would count as the type of repetition we want to avoid...Another edit: apparently there is a solution, albeit an ugly one: http://stackoverflow.com/questions/6870499/generate-series-equivalent-in-mysql