Wednesday, August 10, 2011

Postgres query using window functions, recursive query

As Tom Kyte says, "Analytics rock. Analytics roll." Among other things, Oracle's analytic queries allow a row in the resultset to reference other rows in the same resultset. This is tremendously powerful stuff. If you're not familiar with analytics, take a look at some of the magic Tom performs in that thread I just linked. Some simple use cases:
1) Get a cumulative sum - maybe you have a table of individual purchases and want to see how the sum of purchases progresses over time.
2) Perform your typical GROUP BY query using an aggregate function, BUT you don't want to roll up the data. You want to see every row in the table you're querying, with the AVG (or MAX, SUM, whatever) alongside it. See the first example in this Postgres doc.

But did you know Postgres now has Analytics too, as of 8.4? They are called Window Functions in the Postgres world. (I suppose you could call them "queries that use window functions", or even "analytic queries" as Oracle can't force you not to use the phrase.)

I had a requirement. Stripped down to the core, the requirement was to translate this:

ID | expire
---+---------
 7 |  8/20/2011
 6 | 
 5 |  8/15/2011
 4 |  8/15/2011
 3 |
 2 |
 1 |  8/10/2011

To this:

 ID | expire
---+---------
 7 |  8/20/2011
 6 |  8/15/2011
 5 |  8/15/2011
 4 |  8/15/2011
 3 |  8/10/2011
 2 |  8/10/2011
 1 |  8/10/2011

The translation is: If expire is null, look down to the next older expire. If THAT one is null too, look down again, and so on. The solution that worked very well for me was to use Window Functions, plus recursive ("WITH RECURSIVE") queries.

Edit: A helpful commenter provided this simpler solution: select
 id, max(expire) over (order by id rows between unbounded preceding and current row) as expire
 from t 
order by id desc

The windowing clause in this query says: order the rows by id ascending, and for each row, get the max value of expire looking backwards from this row to all previous rows. Pretty simple! The Oracle solution should be the same, btw.

I'll leave my older code and thought processes in this article for posterity.

Step 1 was to use Window Functions to structure a view of the table above, such that the previous and next IDs could be found in each row, almost like a linked list data structure:

ID   prev_ID  next_id  expire
---  -------  -------  --------
 7     null      6     8/20/2011
 6      7        5
 5      6        4     8/15/2011
 4      5        3     8/15/2011
 3      4        2
 2      3        1
 1      2       null   8/10/2011
(In the real world, my IDs aren't always such simple, gap-free numbers, so I cannot use any simple math tricks, such as finding prev_ID by subtracting 1 from ID.)

Step 2 was to use WITH RECURSIVE to traverse from row to row, via the pointers in our linked list, to form the end resultset we're looking for. This entailed another little trick. Coalesce two expires (the current and the parent) to make the non-null values bubble up.

I'll post the SQL when I get to it.
Creating the view, transaction, with window functions:
CREATE VIEW transaction AS
select
id,
lag(id) over (partition by account_id order by id asc) lag_id,
lead(id) over (partition by account_id order by id asc) lead_id,
expire
from trans;


Recursive query against transaction:
WITH RECURSIVE recur(
id,
expire
) AS (
select
id,
expire
from transaction where lag_id is null
union all
select
transaction.id,
coalesce(transaction.expire, recur.expire)
from transaction, recur
where transaction.lag_id=recur.id
)
select * from recur;

I'm really pleased with the advanced querying features Postgres has been putting out. Actually, WITH RECURSIVE is ANSI standard, as it turns out, and you can even find it in SQL Server, DB2 and Oracle 11g (though Oracle has had its own way of performing recursive queries using CONNECT BY, for ages).

3 comments:

MWrynn said...

This is probably obvious, but note that if you don't index both the columns in the recursive join (in my example, "where transaction.lag_id=recur.id"), the recursive query may run very slow! In my tests, once I got into 10s of thousands of rows it started becoming quite slow. (Probably something like n^2 performance...)

RobJ said...

Here's a way to get the answer with a window function, without recursion:

select
id, max(expire) over (order by id rows between unbounded preceding and current row) as expire
from t
order by id desc

MWrynn said...

Thanks RobJ - didn't know about that one. Nice!