Saturday, March 06, 2010

Partitioned Aggregates ...or... I love "Distinct On"

Let's begin by creating a table dealing with loan data:

postgres=# create table loan (id serial primary key, type varchar(1), issue_date date, issue_amount numeric);
NOTICE: CREATE TABLE will create implicit sequence "loan_id_seq" for serial column "loan.id"
NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "loan_pkey" for table "loan"
CREATE TABLE

Create some test data in an appropriately-named file, testdata:

A|20090405|10000
A|20100101|50000
A|20070605|40000
B|19850815|100000
B|19900430|10000
C|20100304|5555
D|20011122|512591
D|20080517|250259
D|20100206|22222
D|20100204|99999

Load it up...

postgres=# \copy loan (type, issue_date, issue_amount) from 'testdata' delimiter '|'

postgres=# select * from loan;

id | type | issue_date | issue_amount
----+------+------------+--------------
1 | A | 2009-04-05 | 10000
2 | A | 2010-01-01 | 50000
3 | A | 2007-06-05 | 40000
4 | B | 1985-08-15 | 100000
5 | B | 1990-04-30 | 10000
6 | C | 2010-03-04 | 5555
7 | D | 2001-11-22 | 512591
8 | D | 2008-05-17 | 250259
9 | D | 2010-02-06 | 22222
10 | D | 2010-02-04 | 99999
(10 rows)

Now, here's a fairly common query I have to run. I'd like to get the most recent loan per type. That is, I want to tell postgres: For each type value (A,B,C,D), get me the row with the greatest issue_date.

How to do this? We obviously need to use max(issue_date) somewhere. The rest is a little bit more esoteric.

We're going to solve this problem using the unique-to-postgres DISTINCT ON feature. DISTINCT ON (col) gives you one row for each value of col. For example:

postgres=# select distinct on (type) * from loan;

id | type | issue_date | issue_amount
----+------+------------+--------------
1 | A | 2009-04-05 | 10000
4 | B | 1985-08-15 | 100000
6 | C | 2010-03-04 | 5555
7 | D | 2001-11-22 | 512591
(4 rows)

Ta-da! But wait, these aren't the most recent rows per type. An immediately obvious problem is that for type A, the row with id=2 should be returned. Postgres thought: For each type, return the first row I see. How do we get it to return the most recent only?

The answer lies in the good ol' ORDER BY clause. If we sort appropriately, the first row Postgres sees of each type will be the most recent:

postgres=# select distinct on (type) * from loan order by type, issue_date desc;
id | type | issue_date | issue_amount
----+------+------------+--------------
2 | A | 2010-01-01 | 50000
5 | B | 1990-04-30 | 10000
6 | C | 2010-03-04 | 5555
9 | D | 2010-02-06 | 22222
(4 rows)


Now I can really say Ta-da... This was something that was not very intuitive to me, but has proved very useful to know.

How would I solve this problem using Oracle? I would probably use a so-called Analytic query. Like most things Oracle, this feature is powerful but complex. It's more applicable to a wider set of problems, but for this specific problem it seems a little bit complicated.

Unfortunately I don't have an Oracle install handy at the moment, but it should be something like this:

select *
from (select id, type, issue_date, issue_amount,
row_number() over (partition by type order by issue_date desc) rn
from loan)
where rn = 1;

What we're doing above is telling Oracle to make a row number for each row in the table, partitioned by type. Conceptually it's doing something like this:

id|type|...|row_number
--+----+---+----------
1 |A |...|1
2 |A |...|2
3 |A |...|3
4 |B |...|1
5 |B |...|2
6 |C |...|1
7 |D |...|1
8 |D |...|2
9 |D |...|3
10|D |...|4

So above we see a row number assigned to each row, where the row number is reset to 1 for each type.

There is more to it. In the PARTITION BY clause we also have "order by issue_date desc" - this means for each type, row_number 1 is assigned to the row with the greatest issue_date. If we treat this whole query as a subquery, we can simply select * from it where row_number=1. The resultset should be the same as with Postgresql's DISTINCT ON:

 id | type | issue_date | issue_amount
----+------+------------+--------------
  2 | A    | 2010-01-01 |        50000
  5 | B    | 1990-04-30 |        10000
  6 | C    | 2010-03-04 |         5555
  9 | D    | 2010-02-06 |        22222
(4 rows)