Thursday, March 31, 2016

MySQL, auto_increment and the magic number 0

Ran into a MySQL issue at work recently. My group created three static lookup tables which we populated via handcoded insert statements. For this kind of lookup table, we like having control over the primary key (`id`), so as a general rule we explicitly set this column, rather than relying on auto_increment. For example a table might look like the following (this is a made-up example, as I try to maintain separation of work and blog):

EMPLOYEE_TYPE (id, type)
0 Unknown
1 Manager
2 Grunt
3 Overlord

We create the table then insert like so:

create table EMPLOYEE_TYPE (id INT NOT NULL AUTO_INCREMENT PRIMARY KEY, type VARCHAR(20));

insert into EMPLOYEE_TYPE (id, type) values (0, 'Unknown');
insert into EMPLOYEE_TYPE (id, type) values (1, 'Manager');
insert into EMPLOYEE_TYPE (id, type) values (2, 'Grunt');
insert into EMPLOYEE_TYPE (id, type) values (3, 'Overlord');

(We thought it would be a nice convention to use 0 to represent 'Unknown' across several tables.)

When we kicked off this script, it never made it beyond the second statement because:

mysql> insert into EMPLOYEE_TYPE (id, type) values (1, 'Manager');
ERROR 1062 (23000): Duplicate entry '1' for key 'PRIMARY'

MySQL first overrode our 0 value for 'Unknown', setting it to 1. Then we tried to insert 'Manager' with id 1 and BOOM! Primary Key violation. It turns out MySQL has an oddball special case for auto_increment columns. If you explicitly specify your id, it is used instead of the auto_increment. Fine. Perfect. UNLESS THAT NUMBER IS 0, IN WHICH CASE, 0 IS IGNORED AND THE AUTO_INCREMENT VALUE IS INSTEAD USED. WHAT THE--??

What is the meaning of this? From a page in the documentation:
No value was specified for the AUTO_INCREMENT column, so MySQL assigned sequence numbers automatically. You can also explicitly assign 0 to the column to generate sequence numbers. If the column is declared NOT NULL, it is also possible to assign NULL to the column to generate sequence numbers. When you insert any other value into an AUTO_INCREMENT column, the column is set to that value and the sequence is reset so that the next automatically generated value follows sequentially from the largest column value.
Ok so at least it's documented, but wow, there are such odd design decisions in MySQL sometimes! Fortunately they often provide configurable database parameters to disable them. And this strange case is no exception in that one can set sql_mode to NO_AUTO_VALUE_ON_ZERO to disable this loony behavior. Unfortunately I find nobody ever bothers to change these kinds of parameters from the defaults, and the defaults tend to err on the crazy side.

Saturday, March 26, 2016

JUST. BIND.

I saw this BBC article yesterday - These unlucky people have names that break computers - it's about people with names such as Jennifer Null who have problems inputting their name into various systems because, you know, null.
But to any programmer, it’s painfully easy to see why “Null” could cause problems for a database. This is because the word “null” is often inserted into database fields to indicate that there is no data there. Now and again, system administrators have to try and fix the problem for people who are actually named “Null” – but the issue is rare and sometimes surprisingly difficult to solve.
Poor Jennifer Null - always at the mercy of poor programming. But the ghost of my favorite Tom Kyte post always returns when I hear of this kind of problem. JUST BIND!!

I don't know if it's difficult to deal with in Javascript or other parts of a modern system, but the database tier should have no problem. It doesn't matter if you have "NULL" in your name, it doesn't matter if your name is Bobby Tables or has any number of quotes, semicolons or reserved words. It doesn't matter if it contains the string DROP DATABASE or DELETE if...YOU. JUST. BIND.





Saturday, March 05, 2016

Dan Martensen's blog: SQL Performance of Join and Where Exists

Here's a neat little article on the blog of a fellow named Dan Martensen. I really enjoy performance experiments that show how one approach fares vs. another - it's something I do myself as well. :) This one shows how "where exists" performs vs. an inner join in Postgres 9.5.

Here is the article.

It turns out "where exists" is faster, because as the explain plan reveals, a join performs a Merge Join step, while "where exists" can get away with a Merge Semi Join step. What do these steps mean? Let's turn to the documentation (Source):
  • merge join: Each relation is sorted on the join attributes before the join starts. Then the two relations are scanned in parallel, and matching rows are combined to form join rows. This kind of join is more attractive because each relation has to be scanned only once. The required sorting might be achieved either by an explicit sort step, or by scanning the relation in the proper order using an index on the join key. 
As for Merge Semi Join, I can't find such a clear definition in the docs, but Dan Martensen's article (that I linked to above) mentions: "It’s worth noting Where Exists doesn’t need to return a value to the parent query in the SELECT. It’s WHERE clause compares parent to subquery ids using a boolean short circuit early exit."

So maybe that's a clue as for what goes on under the hood -- it just checks for the matching rows' existence then quits. No data retrieval required.

Perhaps I will soon check if MySQL behaves similarly, as my work will involve more MySQL soon!