Monday, March 31, 2014

"In Which We Are Not Having Fun"

Speaking of Postgres materialized views...The guys at sparkfun (makers of Arduino as well other fine products) migrated from MySQL to Postgres for various reasons, and decided to use the newish materialized views. Apparently it was all fun and games until their number of orders increased by an order of magnitude, bringing MV refreshes to their knees. (These MVs keep track of their available stock, it seems.) 

MVs in Postgres are very basic -- no incremental refreshes like Oracle has, so they're crunching all the numbers from square one every single time you say "refresh."

I don't know anything about the sparkfun guys' particular requirements or infrastructure, but I probably would not try to keep Postgres MVs of any complexity constantly up to date in a transactional system. I'm imagining they're trying to update the MVs any time the dependent tables are modified, so too many orders kill them. What if they accepted some data staleness and set the MVs to refresh on a timer (every x minutes)? If the view of stock is out of date by a bit, that might be OK as long as there's a check in place at checkout time. Just kind of wondering out loud here.

This situation shows that testing to scale is necessary, by the way.

On a related note, their reasons for switching from MySQL to Postgres are pretty numerous.

MariaDB 10 Release

Saw this on Slashdot this morning, after driving through a March 31 snowstorm. Replication slaves don't crash now...What a feature! Guess that was a problem. I should check out Maria one of these days.

Tuesday, March 18, 2014

PostgreSQL Materialized View Example

As I mentioned in an earlier post, PostgreSQL finally supports some very basic materialized views as of version 9.3, and this has me more excited than it should! I finally got a chance to give it a try.

Overview of Materialized Views

But first, what is a materialized view? Very briefly, it’s a view whose query results are stored. If a view is a stored query that spits results at you on demand, a materialized view is a view that stores the query results and spits those same results out at you on demand. The ordinary view works hard to get your results every time you query it, while the materialized view has already done the work for you in the past. View=slower, zero disk space; materialized view=faster; potentially lots of disk space.

If the query results change over time, then at some point the materialized view results will be “stale”. You must tell Postgres to refresh your materialized view from time to time. In more advanced databases, materialized views may refresh automatically, either when a dependent table is updated, on a regular schedule, or on demand. Postgres only supplies “on demand”.

A Simple Example: Baseball Stats

Let’s get right into an example. I like baseball, so the goal of this example will be to come up with team batting averages based on the players’ individual plate appearances. My data is very simple, and a plate appearance can either be a hit or an out.  

So, we have these tables:
 /* each row represents a player on a team */  
 CREATE TABLE player (id SERIAL PRIMARY KEY, name VARCHAR, num SMALLINT, team VARCHAR(3));  
 /* each row represents a plate appearance where a player attempts to hit the ball. In our simplistic model, there are two possible outcomes: he either gets a hit or he is out */  
 CREATE TABLE pa (pa_id SERIAL PRIMARY KEY, player_id INT REFERENCES player(id), gamedate DATE, inning SMALLINT, result VARCHAR(1));  

I have populated them with dummy data. The teams I used are the MLB American League East teams: New York Yankees (NYY), Boston Red Sox (BOS), Tampa Bay Rays (TB), Toronto Blue Jays (TOR) and Baltimore Orioles (BAL). 25 players for each team, getting 4 plate appearances per day over 180 days. These are just numbers I made up. I’m not attempting total realism here.

Now I will write a view to get the batting averages per team:
 CREATE OR REPLACE VIEW team_avg_v AS SELECT team, SUM(CASE WHEN result='H' THEN 1 ELSE 0 END)::float/COUNT(*)::float AS avg FROM pa JOIN player p ON pa.player_id=p.id GROUP BY team  

The result:
 mwrynn=# select * from team_avg_v;  
  team |    avg      
 ------+-------------------  
  BAL | 0.252296296296296  
  NYY | 0.244592592592593  
  TB  | 0.246666666666667  
  BOS | 0.250444444444444  
  TOR | 0.248444444444444  
 (5 rows)  
 Time: 248.679  

249 ms is good time, but imagine if we had data for *all* players for *all* teams, even for *all* years baseball was played! This simple view would possibly take hours to run, at least on my wimpy little laptop.

Enter the materialized view. All we have to do is the following to materialize this view: 

 mwrynn=# CREATE MATERIALIZED VIEW team_avg_mv AS SELECT * FROM team_avg_v;  
 SELECT 5  
 Time: 190.997 ms  

And now let’s try querying it:
 mwrynn=# select * from team_avg_mv;  
  team |    avg      
 ------+-------------------  
  BAL | 0.252296296296296  
  NYY | 0.244592592592593  
  TB  | 0.246666666666667  
  BOS | 0.250444444444444  
  TOR | 0.248444444444444  
 (5 rows)  
 Time: 0.755 ms  

We went from 249 ms to less than a ms. Not bad, huh? :)

Data Staleness and Refreshing

Probably the biggest downside to using a materialized view in this manner is it can get stale. In the context of our example, this means we add some more plate appearances for our players, and the averages change a bit. Let’s query the plain old view after adding more data:

 mwrynn=# select * from team_avg_v ;  
  team |    avg      
 ------+-------------------  
  BAL | 0.248888888888889  
  NYY | 0.24862962962963  
  TB  | 0.250592592592593  
  BOS | 0.25037037037037  
  TOR | 0.250074074074074  
 (5 rows)  
 Time: 351.739 ms  

Now let’s show that our materialized view is out of date:
 mwrynn=# select * from team_avg_mv ;  
  team |    avg      
 ------+-------------------  
  BAL | 0.252296296296296  
  NYY | 0.244592592592593  
  TB  | 0.246666666666667  
  BOS | 0.250444444444444  
  TOR | 0.248444444444444  
 (5 rows)  
 Time: 0.710 ms  

We need to refresh it to get it up to speed:
 mwrynn=# refresh materialized view team_avg_mv;  
 REFRESH MATERIALIZED VIEW  
 Time: 360.579 ms  

Our Exciting Conclusion

Note that the refresh took about the same amount of time as the plain old view took to query. That’s no coincidence - it’s doing the same work to get our query results (plus storing the data in the materialized view case, and if we had built any indexes on it, those would have to be updated accordingly). This could be a problem if the underlying data changes often, and it’s not acceptable to see slightly out-of-date data in our query results. So, like anything else, materialized views are a tool to use in the appropriate situation, not a one-size-fits-all magic bullet. Other databases such as Oracle have special incremental refresh modes that can speed up your refreshes - you can sort of let you have your cake and eat it too, but these advanced materialized views tend to get very complex. Maybe at some point I'll show you some of this fancy Oracle materialized view stuff. :)