Thursday, November 10, 2016

MySQL - numeric types don't let you specify "length"

In MySQL I often see developers asking themselves, "Should I make this column an int(10) or int(11)? Should this bigint be bigint(15) or bigint(20)?" The answer is it probably doesn't matter! That number in parentheses does NOT indicate the maximum length of the number. For example, smallint(3) does not max out at 999, but rather it has a range of -32768 to -32767, just like any other smallint (which is 16 bits). The ONLY thing this parenthetical number has anything to do with is an esoteric, display-related feature called ZEROFILL.

ZEROFILL, if enabled, pads the number with zeroes for display purposes -- for example a int(10) with a value of 12345 will come out in your resultset as 0000012345. That's it. That's all it does. Read up more on it here:  

So, in my opinion you should NEVER even bother to define a column as "int(10)" or "bigint(20)" etc., etc., but rather just define an "int" or a "bigint". The only way to express a max length is in terms of which type to use in the first place: tinyint, smallint, int, bigint. (I think that's all of them - int types anyway. :)) So save yourself the clutter of useless additional information, and save yourself the waste of brainpower thinking about max lengths that aren't even real.

Thursday, July 07, 2016

Say NO to Venn Diagrams When Explaining JOINs

Nice post on the jOOQ blog: Say NO to Venn Diagrams When Explaining JOINs ...It's about how visualizing joins as Venn diagrams isn't the best fit -- it's much better for set operators.

Saturday, June 11, 2016

Neat graphical summarization of SQL joins

I'm not sure who to credit for it, I just saw it linked to on reddit with no source.... Note the first one (with the comma) is also called a CROSS JOIN.

Edit: Apologies to the color blind...

Generating a series of integers as a set of rows

In my last post, Cumulative row count over date ranges, I take advantage of a postgres function called generate_series(start, stop) - this generates a series of values, from start to stop. So, for example, generate_series(2,4) returns:

SELECT * FROM generate_series(1,4);

A question that’s been on my mind is how to achieve this in other database systems?


Oracle has a way to generate a series as well, but it’s just not quite as simple as calling a function:

SELECT level i
FROM dual
CONNECT BY level <= 4

Less intuitive but it works nicely. What it’s doing is it uses the CONNECT BY clause on the dual table - this means the 1 column, 1 row dual table is being “connected” to itself recursively — this is called a hierarchical query ( “Level” is a special pseudocolumn that tells you the level of recursion depth. So, while this method is more complicated, it does work nicely!


The MySQL way….

I didn’t know how to achieve generate_series in MySQL off the top of my head, so off to google! Unfortunately the method I found depends on selecting from a table of at least n size, where n is the number of values you wish to generate. Join on a subquery that declares a variable @num, and in your SELECT clause, increment this value per row. I found this solution here, they are using dates but it can be simplified to generate integers.

SELECT @num:=@num+1

| @num:=@num+1 |
|            1 |
|            2 |
|            3 |
|            4 |

4 rows in set, 1 warning (0.02 sec)

Works great but with the caveat that the "dummy" table you select from, in this example INFORMATION_SCHEMA.COLUMNS, must have at least as many rows as you want values. A little scary, in my opinion, to use in a generalized way with this limitation.

MS SQL Server

Oracle-like recursion can be done in SQL Server as well, and wouldn't you know it, some nice developer wrote a generate_series function for SQL Server and shared it on github

Why is generating a series useful, anyway?

There are probably more use cases for generating a series than I can come up with! But as in the previous post I linked to at the top of this one, a series can be useful for implementing something like a for loop. Join against the series as if it were another table, and use its value (call it i) as a parameter to a function, then voila, suddenly you're calling function f for each value of i: f(1), f(2), ..., f(n). This is a great way to create more elegant queries by avoiding copied/pasted repetition. (Again as see in my last post.) 

That's all for now!

Saturday, June 04, 2016

Cumulative row count over date ranges

I recently helped a coworker (Mr. M) with his SQL query that gets a cumulative count over a set of ranges of dates. He wrote a query that functionally works just fine, but he didn't like it because there was a lot of repetition happening. It looked something like:


Imagine in place of [datex] there is some mathy date expression -- I'm leaving it out for now (because I'm lazy) but the idea was subtract a week from today for date1, subtract 14 days from today for date2, etc.

So Mr. M did not like that his copy/pasted query parts violated DRY - Don't Repeat Yourself. I think that's less bad than all this UNIONing and repeated checking against CREATED_ON. It's running 5 different queries that are kind of redundant as they are looking up by the same column repeatedly. How can we improve this query? My first instinct was to use windowing functions, but alas, this is MySQL which lacks them (remember, MySQL is a hair band). I'm sure we can do better, but here's what I came up with as a first hack:

SELECT SUM(bucket1) AS week1,
 SUM(bucket2) AS week2,
 SUM(bucket3) AS week3,
 SUM(bucket4) AS week4,
 SUM(bucket5) as week5
 SELECT CASE WHEN CREATED_ON <= ADDDATE(current_date(), INTERVAL -1-(0*7) DAY) THEN 1 ELSE 0 END AS bucket1,
 CASE WHEN CREATED_ON <= ADDDATE(current_date(), INTERVAL -1-(1*7) DAY) THEN 1 ELSE 0 END AS bucket2,
 CASE WHEN CREATED_ON <= ADDDATE(current_date(), INTERVAL -1-(2*7) DAY) THEN 1 ELSE 0 END AS bucket3,
 CASE WHEN CREATED_ON <= ADDDATE(current_date(), INTERVAL -1-(3*7) DAY) THEN 1 ELSE 0 END AS bucket4,
 CASE WHEN CREATED_ON <= ADDDATE(current_date(), INTERVAL -1-(4*7) DAY) THEN 1 ELSE 0 END AS bucket5

Well, I wasn't able to remove copy/pasted repetition, but I do think removing the UNION ALLs is a positive. What does this query do?

Step 1) The subquery. This is the bulk of the query, which passes over the whole table just once, and puts each row in an bucket from 1 through 5. It will perform 1 scan per row and 5 comparisons per row, vs. scanning each row 5 scans per row and 5 comparison per row.

Step 2) Sum them up!

Now in this specific instance the performance gain was pretty negligible, and we didn't really neaten the query all that much. I wouldn't blame Mr. M if he decides to stick with old query. As far as performance goes, the issue is this table was very small with only about 50k rows, therefore the cost of repeating 5 queries was negligible compared to all the added logic of those CASE WHEN expressions that I added.

Is there a way to avoid the repetition? 
The answer is, for MySQL, I don't know! But I wonder if I can figure something out with Postgres...

So first I made a sample table MYTABLE with 100,001 rows of randomly generated data.

To write the query, I decided to define the 5 dates using a common table expression, this way we can reference it as an imaginary table, so to speak:

mwrynn=# WITH weeks AS (SELECT now() - i * interval '1 week' AS date_point from generate_series(0,4) i)
SELECT * FROM weeks;

 2016-06-04 13:14:00.560272-04
 2016-05-28 13:14:00.560272-04
 2016-05-21 13:14:00.560272-04
 2016-05-14 13:14:00.560272-04
 2016-05-07 13:14:00.560272-04
(5 rows)

Next, let's join that to our table so that for each date_point, get the rows where created_on is less than or equal to the date_point...

mwrynn=# WITH weeks AS (SELECT now() - i * interval '1 week' AS date_point from generate_series(0,4) i)
SELECT * FROM weeks JOIN mytable ON mytable.created_on <= weeks.date_point;

          date_point           | created_on 
 2016-06-04 13:15:40.869402-04 | 2015-12-12
 2016-06-04 13:15:40.869402-04 | 2016-02-19
 2016-06-04 13:15:40.869402-04 | 2016-04-20
 2016-06-04 13:15:40.869402-04 | 2016-02-01
[snipping tons of rows]

Final step: get the counts per date_point:

mwrynn=# WITH weeks AS (SELECT now() - i * interval '1 week' AS date_point from generate_series(0,4) i)
SELECT COUNT(*), date_point
     mytable ON mytable.created_on <= weeks.date_point
GROUP BY date_point
ORDER BY date_point DESC;

 count  |          date_point          
 100001 | 2016-06-04 13:17:29.33812-04
  98232 | 2016-05-28 13:17:29.33812-04
  96290 | 2016-05-21 13:17:29.33812-04
  94313 | 2016-05-14 13:17:29.33812-04
  92350 | 2016-05-07 13:17:29.33812-04
(5 rows)

TA-DA!!! WE DID IT! We got the cumulative counts without copying/pasting several queries or several conditions - we just had to define the weeks we concerned about (weeks 0 through 4) in the call to generate_series(), put that in a CTE, and query that with a join... [Edit: Note, we should even be able to parameterize the number of weeks by doing something like generate_series(0,?) -- this way we could easily use any number of weeks.] My initial instinct of requiring window functions was wrong. But a big THANK YOU TO COMMON TABLE EXPRESSIONS. :) 

Question: Does anyone know how to do this in MySQL??

Day-After Edit: I guess a similar technique could be used in MySQL, only we must replace the CTE with a subquery, and replace generate_series with some way of selecting 0-4.... I guess this would work: (SELECT 0 UNION SELECT 1 UNION SELECT 2 UNION SELECT 3 UNION SELECT 4) - but perhaps this would count as the type of repetition we want to avoid...Another edit: apparently there is a solution, albeit an ugly one:

Saturday, May 07, 2016

Postgres queries using Window Functions, Part Deux

Since my post Postgres query using window functions, recursive query is my most popular of all-time (maybe not saying so much for my little blog :)), I thought I would write more on the subject. How about some useful examples?

A Cumulative Sum

First time I saw this my mind was blown - I didn't realize SQL could be so flexible! Maybe I have a table that represents sales per day:

mwrynn=# create table sales (sales_date date, dept varchar, total_sales numeric(10,2));

/* after some inserts which I'll skip, we have 4 days per dept (maybe it's a department store database) */
mwrynn=# select * from sales;
 sales_date | dept  | total_sales 
 2016-01-01 | men   |      100.00
 2016-01-01 | women |      140.00
 2016-01-02 | men   |       85.00
 2016-01-02 | women |       70.00
 2016-01-03 | men   |      135.00
 2016-01-03 | women |      135.00
 2016-01-04 | men   |      200.00
 2016-01-04 | women |      240.00
(8 rows)

/* now let's get a cumulative sum for each dept per day */
mwrynn=# select sales_date, dept, total_sales, sum(total_sales) over (partition by dept order by sales_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) from sales;
 sales_date | dept  | total_sales |  sum   
 2016-01-01 | men   |      100.00 | 100.00
 2016-01-02 | men   |       85.00 | 185.00
 2016-01-03 | men   |      135.00 | 320.00
 2016-01-04 | men   |      200.00 | 520.00
 2016-01-01 | women |      140.00 | 140.00
 2016-01-02 | women |       70.00 | 210.00
 2016-01-03 | women |      135.00 | 345.00
 2016-01-04 | women |      240.00 | 585.00
(8 rows)

So simple! But the syntax is a little unintuitive, at least to me. Let's break down the key parts.

sum(total_sales): this is just your basic aggregate function sum()
over (): this means to apply the sum function OVER specific subsets, or partitions, of our data.
partition by: defines the partitions. Here we are telling Postgres, "split up the data by dept, aka into men and women, and apply our sum function to each as a completely separate subset (partition)"
order by: sort each partition by this clause, sales_date in our case.
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW: means to apply the sum for the current row and all preceding rows. This is known as the windowing frame. This is the key to achieving a cumulative sum - for each row, we want all the previous rows + the current row, but not the subsequent rows...

We can also run a simpler version of the query, taking the whole dataset as one partition instead of splitting by men and women:

 mwrynn=# select sales_date, dept, total_sales, sum(total_sales) over (order by sales_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) from sales;
 sales_date | dept  | total_sales |   sum   
 2016-01-01 | men   |      100.00 |  100.00
 2016-01-01 | women |      140.00 |  240.00
 2016-01-02 | men   |       85.00 |  325.00
 2016-01-02 | women |       70.00 |  395.00
 2016-01-03 | men   |      135.00 |  530.00
 2016-01-03 | women |      135.00 |  665.00
 2016-01-04 | men   |      200.00 |  865.00
 2016-01-04 | women |      240.00 | 1105.00
(8 rows)

The rank() Function

We can riff off this same table/query and ask Postgres another query -- give me the rank of sales per dept, lowest to highest. I want to see on which date I sold the least, the most and everything in between. The key is in the rank() function...

mwrynn=# select sales_date, dept, total_sales, rank() over (partition by dept order by total_sales) from sales;
 sales_date | dept  | total_sales | rank 
 2016-01-02 | men   |       85.00 |    1
 2016-01-01 | men   |      100.00 |    2
 2016-01-03 | men   |      135.00 |    3
 2016-01-04 | men   |      200.00 |    4
 2016-01-02 | women |       70.00 |    1
 2016-01-03 | women |      135.00 |    2
 2016-01-01 | women |      140.00 |    3
 2016-01-04 | women |      240.00 |    4
(8 rows)

The ntile() Function

This function is called ntile(), perhaps as in a generic version of "quartile" or "percentile", in which you can specify any number. Let's say I want to put the above sales into buckets. I only have a small amount of data so we'll use a bucket number of 2 -- this allows us to bucket each row into either the lesser half of sales or the greater half.

mwrynn=# select sales_date, dept, total_sales, ntile(2) over (partition by dept order by total_sales) from sales;
 sales_date | dept  | total_sales | ntile 
 2016-01-02 | men   |       85.00 |     1
 2016-01-01 | men   |      100.00 |     1
 2016-01-03 | men   |      135.00 |     2
 2016-01-04 | men   |      200.00 |     2
 2016-01-02 | women |       70.00 |     1
 2016-01-03 | women |      135.00 |     1
 2016-01-01 | women |      140.00 |     2
 2016-01-04 | women |      240.00 |     2
(8 rows)

Simple enough, right? So again we've partitioned by the mens vs. women dept, and we're getting the lesser half of sales (bucket 1) vs. the greater half of sales (bucket 2).

Wrapping Up

Well that's enough of window functions for now. We've only begun to scratch the surface, so I may come back and add to this post. If you have any questions or suggestions, let me know!

Saturday, April 30, 2016

Comparing adding a column with default values between MySQL and Oracle, or, "Adding a new column is sloooow :("

In this post, I want to talk a little about something that seems to baffle some developers from time to time. It baffles us in the sense of, "I can't believe this simple operation is slow, what the hell is going on?" Happened today in fact with MySQL. That simple operation is nothing more than adding a column to a table! At least, adding column to a table that's full of data, perhaps millions of rows. Why should a column take so long to add? Well, it doesn't ALWAYS - it depends on whether your database has optimized column adding, and how much data your table has. Let's look into it a little...


In MySQL land, here's what happens:
mysql> CREATE TABLE t (a INT, b INT); 
Query OK, 0 rows affected (0.03 sec)
mysql> insert into t SELECT rand(), rand() FROM INFORMATION_SCHEMA.COLUMNS c1 
Query OK, 3825936 rows affected, 2 warnings (10.99 sec)
Records: 3825936  Duplicates: 0  Warnings: 2
Query OK, 3825936 rows affected (12.77 sec)
Records: 3825936  Duplicates: 0  Warnings: 0
12.8 seconds might not seem so bad, but what if this were a busy production table? What if there were 10 times as many rows? Quickly, just to prove that the time does increase linearly with number of rows, let's double the table size then try again:
mysql> INSERT INTO t SELECT * FROM t; Query OK, 3825936 rows affected (14.38 sec) Records: 3825936 Duplicates: 0 Warnings: 0 mysql> ALTER TABLE t ADD COLUMN d INT DEFAULT NULL; Query OK, 7651872 rows affected (25.80 sec) Records: 7651872 Duplicates: 0 Warnings: 0
Yup, double the size, double the time. [Note: MySQL version is 5.5 -- in this version an alter table like the above will recreate the entire table!]


In Oracle land, as usual they're ahead of the curve and have a solution for version 12c. Let's quote Mr. Kyte's description of this feature from
More Online Operations: Better Column Addition. In Oracle Database 11g you were able to perform a fast add of a column to a table if it had a default value and was defined as NOT NULL. (Arup Nanda has written about this at However, if you attempted to add a column with a default value and that column permitted null values, the ADD COLUMN operation could take a significant amount of time, generate a large amount of undo and redo, and lock the entire table for the duration of the operation. In Oracle Database 12c, that time, volume, and locking are no longer part of the process.
To demonstrate this, I copy ALL_OBJECTS into a table and measure its space—in blocks and bytes—using the show_space utility, posted on
SQL> create table t
  2  as
  3  select *
  4    from all_objects;
Table created.

SQL> exec show_space('T')
Full Blocks        ....        1,437
Total Blocks...........        1,536
Total Bytes............   12,582,912
Total MBytes...........           12

PL/SQL procedure successfully completed.

Now I add a column to table T, and this column will have a large default value. Because the column I’m adding is a CHAR(2000), it will always consume the full 2,000 bytes, given that the CHAR type is always blank-padded and fixed-width. Table T has more than 87,000 records, so adding a column would typically take a significant amount of time, but as you can see, the addition is practically instantaneous in Oracle Database 12c
SQL> set timing on
SQL> alter table t 
add (data char(2000) default 'x');
Table altered.
Elapsed: 00:00:00.07

I perform the identical operation in Oracle Database 11g and observe the following timing: 
SQL> set timing on
SQL> alter table t 
add (data char(2000) default 'x');
Table altered.
Elapsed: 00:00:28.59

Clearly, that’s a significant difference in runtimes.
What magic is going on under Oracle's hood? How can it get away with a near instant column addition while MySQL (and others) need to write the new default value n times. I don't precisely know what Oracle is doing, but I can offer a conjecture. What would I do to solve the problem? I would store the default once as a property of the table, and any time the table is queried, if the column value is missing, just kind of inject the default (which we have ready to go in memory at the time) in place. I found some insight in this article by Mohamed Houri at Oracle:> alter table t1 add C_DDL number default 42 not null;

Table altered.
Elapsed: 00:00:00.04
The C_DDL column has been added instantaneously in 11gR2 database while it took almost 49 seconds in 10gR2. What is this new mechanism that allows such an extremely rapid execution time when adding a not null column with default value to an existing table?
How could 3 million of rows be updated in 4 milliseconds?
Let’s verify visually if the update has been really done (from now and on when the Oracle version is not specified it will then refer to
SQL> select count(1) from t1;


SQL> select count(1) from t1 where c_ddl = 42;


Although Oracle has altered the t1 table instantaneously, the query is showing that the whole bunch of C_DDL column has been updated with their default value set to 42. How could this be possible? Will the execution plan be of any help here?
SQL> select * from table(dbms_xplan.display_cursor);

| Id | Operation        | Name | Rows | Bytes | Cost (%CPU)| Time          |
| 0 | SELECT STATEMENT  |      |      |       | 3016 (100) |               |
| 1 | SORT AGGREGATE    |      |    1 |     3 |            |               |
|*2 | TABLE ACCESS FULL | T1   | 2999K|  8788K| 3016   (5) | 00:00:10      |

 Predicate Information (identified by operation id):
     2 - filter(NVL("C_DDL",42)=42) 

Notice here again how the predicate part of the above execution plan can reveal vital information when trying to understand what is happening behind the scene. Despite I haven’t used the NVL function in my query, this one appears in the predicate part indicating that, internally, Oracle is still considering the C_DDL column as to be potentially able to contain null values (which means it has not been updated) and, as such, Oracle is replacing it with its default value 42.
So there you have it- it does pretty much what I guessed. But that's an 11g example that requires a NOT NULL constraint.

So what about when the default is 42 but NULLs are allowed? Can it optimize this situation? According to the Tom Kyte quote above, the answer appears to be yes. But how does it distinguish a "real" null with simply a non-set value due to the ADD COLUMN optimization? I don't have the answer, but perhaps Oracle writes an explicit NULL when you do something like "UPDATE t SET data=NULL" as opposed to when it is simply not set. NULL is different from missing. The Mohamed Houri article continues, with respect to default values on columns that allow NULLs:> alter table t1 add C_DDL_2 number default 84;

Table altered.

Elapsed: 00:00:58.25

12c> alter table t1 add C_DDL_2 number default 84;

Elapsed: 00:00:00.02

While adding the nullable C_DDL_2 column took 58 seconds to be honored in 11gR2 it has been instantaneously done in 12c.
This is a clear demonstration that in Oracle Database 12c, DDL optimization has been extended to include null columns having default values. Indeed, when you query t1 table to get the distinct values of the newly added column (C_DDL_2) you will realize that the entire table rows have seen their metadata (default value 84) updated as shown via the following query:
12c> select c_ddl_2, count(1) from t1 group by c_ddl_2;

   C_DDL_2    COUNT(1)
   -------   ----------
    84        3000000

SQL> select count(1) from t1 where c_ddl_2=84;


SQL> select * from table(dbms_xplan.display_cursor);

| Id | Operation        | Name | Rows | Bytes | Cost (%CPU)| Time         |
| 0 | SELECT STATEMENT  |      |      |       | 3803 (100) |              |
| 1 | SORT AGGREGATE    |      |    1 |    13 |            |              |
|* 2| TABLE ACCESS FULL | T1   | 3538K|    43M|   3803 (1) |  00:00:01    |

Predicate Information (identified by operation id):
    2 - filter(DECODE(TO_CHAR(SYS_OP_VECBIT("SYS_NC00006$",0)),NULL,NVL("
[Mwrynn: WHOA, what is this funkiness??]
- dynamic statistics used: dynamic sampling (level=2) 

However, in order to ensure DDL optimization for null columns with default value, things became more complex than it used to be for not null columns in the preceding release. We went from a simple implicit use of the NVL function to a complex and exotic predicate part involving SYS_OP_VECBIT Oracle non documented function and a new internal column SYS_NC00006$ in order to honor the default value since this one has not been physically updated.

One more thought courtesy of my buddy Kevin. What happens if you change the default? So you have:

  • At time=0, table t has columns a,b with a million rows
  • At time=1, table t has columns a,b,c with c defaulting to 42. Therefore the million new c fields are set to 42 (though optimized)
  • At time=2, table t column c is changed to default to 41. Now the optimization would be broken, since all the values from time=1 should have 42. Does Oracle at this point decide to "fill in" all the 42s for existing rows?? If that's the case, this step should be slow. I think this calls for an experiment! Stay tuned...