A Cumulative Sum
First time I saw this my mind was blown - I didn't realize SQL could be so flexible! Maybe I have a table that represents sales per day:
mwrynn=# create table sales (sales_date date, dept varchar, total_sales numeric(10,2));
/* after some inserts which I'll skip, we have 4 days per dept (maybe it's a department store database) */
mwrynn=# select * from sales;
sales_date | dept | total_sales
------------+-------+-------------
2016-01-01 | men | 100.00
2016-01-01 | women | 140.00
2016-01-02 | men | 85.00
2016-01-02 | women | 70.00
2016-01-03 | men | 135.00
2016-01-03 | women | 135.00
2016-01-04 | men | 200.00
2016-01-04 | women | 240.00
(8 rows)
/* now let's get a cumulative sum for each dept per day */
mwrynn=# select sales_date, dept, total_sales, sum(total_sales) over (partition by dept order by sales_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) from sales;
sales_date | dept | total_sales | sum
------------+-------+-------------+--------
2016-01-01 | men | 100.00 | 100.00
2016-01-02 | men | 85.00 | 185.00
2016-01-03 | men | 135.00 | 320.00
2016-01-04 | men | 200.00 | 520.00
2016-01-01 | women | 140.00 | 140.00
2016-01-02 | women | 70.00 | 210.00
2016-01-03 | women | 135.00 | 345.00
2016-01-04 | women | 240.00 | 585.00
(8 rows)
So simple! But the syntax is a little unintuitive, at least to me. Let's break down the key parts.
sum(total_sales): this is just your basic aggregate function sum()
over (): this means to apply the sum function OVER specific subsets, or partitions, of our data.
partition by: defines the partitions. Here we are telling Postgres, "split up the data by dept, aka into men and women, and apply our sum function to each as a completely separate subset (partition)"
order by: sort each partition by this clause, sales_date in our case.
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW: means to apply the sum for the current row and all preceding rows. This is known as the windowing frame. This is the key to achieving a cumulative sum - for each row, we want all the previous rows + the current row, but not the subsequent rows...
We can also run a simpler version of the query, taking the whole dataset as one partition instead of splitting by men and women:
mwrynn=# select sales_date, dept, total_sales, sum(total_sales) over (order by sales_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) from sales;
sales_date | dept | total_sales | sum
------------+-------+-------------+---------
2016-01-01 | men | 100.00 | 100.00
2016-01-01 | women | 140.00 | 240.00
2016-01-02 | men | 85.00 | 325.00
2016-01-02 | women | 70.00 | 395.00
2016-01-03 | men | 135.00 | 530.00
2016-01-03 | women | 135.00 | 665.00
2016-01-04 | men | 200.00 | 865.00
2016-01-04 | women | 240.00 | 1105.00
(8 rows)
The rank() Function
We can riff off this same table/query and ask Postgres another query -- give me the rank of sales per dept, lowest to highest. I want to see on which date I sold the least, the most and everything in between. The key is in the rank() function...
mwrynn=# select sales_date, dept, total_sales, rank() over (partition by dept order by total_sales) from sales;
sales_date | dept | total_sales | rank
------------+-------+-------------+------
2016-01-02 | men | 85.00 | 1
2016-01-01 | men | 100.00 | 2
2016-01-03 | men | 135.00 | 3
2016-01-04 | men | 200.00 | 4
2016-01-02 | women | 70.00 | 1
2016-01-03 | women | 135.00 | 2
2016-01-01 | women | 140.00 | 3
2016-01-04 | women | 240.00 | 4
(8 rows)
The ntile() Function
This function is called ntile(), perhaps as in a generic version of "quartile" or "percentile", in which you can specify any number. Let's say I want to put the above sales into buckets. I only have a small amount of data so we'll use a bucket number of 2 -- this allows us to bucket each row into either the lesser half of sales or the greater half.
mwrynn=# select sales_date, dept, total_sales, ntile(2) over (partition by dept order by total_sales) from sales;
sales_date | dept | total_sales | ntile
------------+-------+-------------+-------
2016-01-02 | men | 85.00 | 1
2016-01-01 | men | 100.00 | 1
2016-01-03 | men | 135.00 | 2
2016-01-04 | men | 200.00 | 2
2016-01-02 | women | 70.00 | 1
2016-01-03 | women | 135.00 | 1
2016-01-01 | women | 140.00 | 2
2016-01-04 | women | 240.00 | 2
(8 rows)
Simple enough, right? So again we've partitioned by the mens vs. women dept, and we're getting the lesser half of sales (bucket 1) vs. the greater half of sales (bucket 2).
Wrapping Up
Well that's enough of window functions for now. We've only begun to scratch the surface, so I may come back and add to this post. If you have any questions or suggestions, let me know!
No comments:
Post a Comment