Monday, February 19, 2018

Retrosheet/Oracle Project


I'm brushing up on my Oracle partitioning skills by loading up the Retrosheet baseball data into a table I'm calling "play" - it's called that because it has every bit of data down to the play level going back as far as 1921! (Except Retrosheet lacks data for years 1923-1924...Does anybody know if something happened in baseball in those two years?) Anyway it's not a HUGE amount of data - about 14 million rows - but I thought it would be a sizable enough data set to use for experimenting with partitions and perhaps materialized views.

My thought for partitioning was to partition by year using range/interval partitioning. This way I can tell Oracle to automatically create partitions as data is inserted. I created a few partitions manually as a baseline - I think I could've gotten away with just one. See the DDL below:

CREATE TABLE play ( 
  hometeam VARCHAR2(3),
  visteam VARCHAR2(3),
  gamedate DATE,
  gamenum NUMBER(2,0),
  inning NUMBER(2,0),
  batter VARCHAR2(8),
  pitcher VARCHAR2(8),
  result VARCHAR2(30),
  badj VARCHAR2(1)
)
PARTITION BY RANGE(gamedate)
INTERVAL(NUMTOYMINTERVAL(1, 'YEAR'))
  ( PARTITION p0 VALUES LESS THAN (TO_DATE('1-1-1922', 'DD-MM-YYYY')),
    PARTITION p1 VALUES LESS THAN (TO_DATE('1-1-1923', 'DD-MM-YYYY')),
    PARTITION p2 VALUES LESS THAN (TO_DATE('1-1-1924', 'DD-MM-YYYY')),
    PARTITION p3 VALUES LESS THAN (TO_DATE('1-1-1925', 'DD-MM-YYYY'))
  )
;

Now after inserting all the data (1921-2017), I can see in SQL Developer that it automatically created lots of extra partitions. Nice! My experience with other databases is that this would need to be done manually, so I appreciate this feature.

SQL Developer view of the partitions

I once read a quote from a big data company CEO saying that for the sport of basketball, you could not calculate player metrics in close to real time with an old fart relational database - you needed his super-duper, high-scale big data infrastructure to do anything like that. Now I don't know much about basketball, so for all I know he's right, but I think baseball would give us a close enough comparison...

So my next mission is to aggregate this data in the form of materialized views, and get some player metrics such as batting average per year and of all time. I'm going to try to take advantage of the FAST REFRESH feature of materialized views to calculate these metrics in close to real time. Imagine data trickling into the PLAY table, and almost instantly we can update the player's batting average, and even tell you the rank of his batting average not only for the year, but for all time. I'll see if it can be done!

No comments: