Mark W the SQL Geek: June 2016

Saturday, June 11, 2016

Neat graphical summarization of SQL joins

I'm not sure who to credit for it, I just saw it linked to on reddit with no source.... Note the first one (with the comma) is also called a CROSS JOIN.

Edit: Apologies to the color blind...

Generating a series of integers as a set of rows

In my last post, Cumulative row count over date ranges, I take advantage of a postgres function called generate_series(start, stop) - this generates a series of values, from start to stop. So, for example, generate_series(1,4) returns:

SELECT * FROM generate_series(1,4);

generate_series

-----------------

A question that’s been on my mind is how to achieve this in other database systems?

Oracle

Oracle has a way to generate a series as well, but it’s just not quite as simple as calling a function:

SELECT level i

FROM dual

CONNECT BY level <= 4

Less intuitive but it works nicely. What it’s doing is it uses the CONNECT BY clause on the dual table - this means the 1 column, 1 row dual table is being “connected” to itself recursively — this is called a hierarchical query (https://docs.oracle.com/database/121/SQLRF/queries003.htm). “Level” is a special pseudocolumn that tells you the level of recursion depth. So, while this method is more complicated, it does work nicely!

MySQL

The MySQL way….

I didn’t know how to achieve generate_series in MySQL off the top of my head, so off to google! Unfortunately the method I found depends on selecting from a table of at least n size, where n is the number of values you wish to generate. Join on a subquery that declares a variable @num, and in your SELECT clause, increment this value per row. I found this solution here, they are using dates but it can be simplified to generate integers. http://stackoverflow.com/questions/6870499/generate-series-equivalent-in-mysql

SELECT @num:=@num+1

FROM INFORMATION_SCHEMA.COLUMNS, (select @num:=0) num

LIMIT 4

+--------------+

| @num:=@num+1 |

+--------------+

| 1 |

| 2 |

| 3 |

| 4 |

+--------------+

4 rows in set, 1 warning (0.02 sec)

Works great but with the caveat that the "dummy" table you select from, in this example INFORMATION_SCHEMA.COLUMNS, must have at least as many rows as you want values. A little scary, in my opinion, to use in a generalized way with this limitation. If you aren't sure if you have enough, you could cross join the table to itself. So for example, if INFORMATION_SCHEMA.COLUMNS only has 1000 rows, you could cross join it to itself to get a max of 1,000,000 rows like so:

... FROM INFORMATION_SCHEMA.COLUMNS AS a, INFORMATION_SCHEMA.COLUMNS AS b ...

MS SQL Server

Oracle-like recursion can be done in SQL Server as well, and wouldn't you know it, some nice developer wrote a generate_series function for SQL Server and shared it on github! UPDATE: SQL Server 2022+ provides a built-in generate_series() function!

Why is generating a series useful, anyway?

There are probably more use cases for generating a series than I can come up with! But as in the previous post I linked to at the top of this one, a series can be useful for implementing something like a for loop. Join against the series as if it were another table, and use its value (call it i) as a parameter to a function, then voila, suddenly you're calling function f for each value of i: f(1), f(2), ..., f(n). This is a great way to create more elegant queries by avoiding copied/pasted repetition. (Again as see in my last post.)

It's also great for generating test data, perhaps large amounts of it. Maybe you want to test a performance hypothesis on a schema that you're currently developing. You don't have any real production data yet. You can pull rows of data out of thin air using generate_series() or similar methods, then run your queries and see how they perform.

That's all for now!

EDIT: These days SQL Server has its own generate_series() function!

MySQL now also supports recursive CTEs, so you could try to generate a series of rows that way. Just be careful of the recursion limit.

Also worth nothing: if you're using Spark/Databricks SQL, there is the range() function which is similar to generate_series():

Saturday, June 04, 2016

Cumulative row count over date ranges

I recently helped a coworker (Mr. M) with his SQL query that gets a cumulative count over a set of ranges of dates. He wrote a query that functionally works just fine, but he didn't like it because there was a lot of repetition happening. It looked something like:

SELECT [date1], COUNT(*) FROM MYTABLE WHERE CREATED_ON <= [date1] UNION ALL
SELECT [date2], COUNT(*) FROM MYTABLE WHERE CREATED_ON <= [date2] UNION ALL
SELECT [date3], COUNT(*) FROM MYTABLE WHERE CREATED_ON <= [date3] UNION ALL
SELECT [date4], COUNT(*) FROM MYTABLE WHERE CREATED_ON <= [date4] UNION ALL
SELECT [date5], COUNT(*) FROM MYTABLE WHERE CREATED_ON <= [date5]

Imagine in place of [datex] there is some mathy date expression -- I'm leaving it out for now (because I'm lazy) but the idea was subtract a week from today for date1, subtract 14 days from today for date2, etc.

So Mr. M did not like that his copy/pasted query parts violated DRY - Don't Repeat Yourself. I think that's less bad than all this UNIONing and repeated checking against CREATED_ON. It's running 5 different queries that are kind of redundant as they are looking up by the same column repeatedly. How can we improve this query? My first instinct was to use windowing functions, but alas, this is MySQL which lacks them (remember, MySQL is a hair band). I'm sure we can do better, but here's what I came up with as a first hack:

SELECT SUM(bucket1) AS week1,
SUM(bucket2) AS week2,
SUM(bucket3) AS week3,
SUM(bucket4) AS week4,
SUM(bucket5) as week5
FROM (
SELECT CASE WHEN CREATED_ON <= ADDDATE(current_date(), INTERVAL -1-(0*7) DAY) THEN 1 ELSE 0 END AS bucket1,
CASE WHEN CREATED_ON <= ADDDATE(current_date(), INTERVAL -1-(1*7) DAY) THEN 1 ELSE 0 END AS bucket2,
CASE WHEN CREATED_ON <= ADDDATE(current_date(), INTERVAL -1-(2*7) DAY) THEN 1 ELSE 0 END AS bucket3,
CASE WHEN CREATED_ON <= ADDDATE(current_date(), INTERVAL -1-(3*7) DAY) THEN 1 ELSE 0 END AS bucket4,
CASE WHEN CREATED_ON <= ADDDATE(current_date(), INTERVAL -1-(4*7) DAY) THEN 1 ELSE 0 END AS bucket5
FROM
MYTABLE) sub

Well, I wasn't able to remove copy/pasted repetition, but I do think removing the UNION ALLs is a positive. What does this query do?

Step 1) The subquery. This is the bulk of the query, which passes over the whole table just once, and puts each row in an bucket from 1 through 5. It will perform 1 scan per row and 5 comparisons per row, vs. scanning each row 5 scans per row and 5 comparison per row.

Step 2) Sum them up!

Now in this specific instance the performance gain was pretty negligible, and we didn't really neaten the query all that much. I wouldn't blame Mr. M if he decides to stick with old query. As far as performance goes, the issue is this table was very small with only about 50k rows, therefore the cost of repeating 5 queries was negligible compared to all the added logic of those CASE WHEN expressions that I added.

Is there a way to avoid the repetition?
The answer is, for MySQL, I don't know! But I wonder if I can figure something out with Postgres...

So first I made a sample table MYTABLE with 100,001 rows of randomly generated data.

To write the query, I decided to define the 5 dates using a common table expression, this way we can reference it as an imaginary table, so to speak:

mwrynn=# WITH weeks AS (SELECT now() - i * interval '1 week' AS date_point from generate_series(0,4) i)
SELECT * FROM weeks;

date_point

-------------------------------

2016-06-04 13:14:00.560272-04

2016-05-28 13:14:00.560272-04

2016-05-21 13:14:00.560272-04

2016-05-14 13:14:00.560272-04

2016-05-07 13:14:00.560272-04

(5 rows)

Next, let's join that to our table so that for each date_point, get the rows where created_on is less than or equal to the date_point...

mwrynn=# WITH weeks AS (SELECT now() - i * interval '1 week' AS date_point from generate_series(0,4) i)

SELECT * FROM weeks JOIN mytable ON mytable.created_on <= weeks.date_point;

date_point | created_on

-------------------------------+------------

2016-06-04 13:15:40.869402-04 | 2015-12-12

2016-06-04 13:15:40.869402-04 | 2016-02-19

2016-06-04 13:15:40.869402-04 | 2016-04-20

2016-06-04 13:15:40.869402-04 | 2016-02-01

[snipping tons of rows]

Final step: get the counts per date_point:

mwrynn=# WITH weeks AS (SELECT now() - i * interval '1 week' AS date_point from generate_series(0,4) i)

SELECT COUNT(*), date_point

FROM weeks JOIN

mytable ON mytable.created_on <= weeks.date_point

GROUP BY date_point

ORDER BY date_point DESC;

count | date_point

--------+------------------------------

100001 | 2016-06-04 13:17:29.33812-04

98232 | 2016-05-28 13:17:29.33812-04

96290 | 2016-05-21 13:17:29.33812-04

94313 | 2016-05-14 13:17:29.33812-04

92350 | 2016-05-07 13:17:29.33812-04

(5 rows)

TA-DA!!! WE DID IT! We got the cumulative counts without copying/pasting several queries or several conditions - we just had to define the weeks we concerned about (weeks 0 through 4) in the call to generate_series(), put that in a CTE, and query that with a join... [Edit: Note, we should even be able to parameterize the number of weeks by doing something like generate_series(0,?) -- this way we could easily use any number of weeks.] My initial instinct of requiring window functions was wrong. But a big THANK YOU TO COMMON TABLE EXPRESSIONS. :)

Question: Does anyone know how to do this in MySQL??

Day-After Edit: I guess a similar technique could be used in MySQL, only we must replace the CTE with a subquery, and replace generate_series with some way of selecting 0-4.... I guess this would work: (SELECT 0 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4) - but perhaps this would count as the type of repetition we want to avoid...Another edit: apparently there is a solution, albeit an ugly one: http://stackoverflow.com/questions/6870499/generate-series-equivalent-in-mysql