Monday, February 19, 2018

Retrosheet/Oracle Project


I'm brushing up on my Oracle partitioning skills by loading up the Retrosheet baseball data into a table I'm calling "play" - it's called that because it has every bit of data down to the play level going back as far as 1921! (Except Retrosheet lacks data for years 1923-1924...Does anybody know if something happened in baseball in those two years?) Anyway it's not a HUGE amount of data - about 14 million rows - but I thought it would be a sizable enough data set to use for experimenting with partitions and perhaps materialized views.

My thought for partitioning was to partition by year using range/interval partitioning. This way I can tell Oracle to automatically create partitions as data is inserted. I created a few partitions manually as a baseline - I think I could've gotten away with just one. See the DDL below:

CREATE TABLE play ( 
  hometeam VARCHAR2(3),
  visteam VARCHAR2(3),
  gamedate DATE,
  gamenum NUMBER(2,0),
  inning NUMBER(2,0),
  batter VARCHAR2(8),
  pitcher VARCHAR2(8),
  result VARCHAR2(30),
  badj VARCHAR2(1)
)
PARTITION BY RANGE(gamedate)
INTERVAL(NUMTOYMINTERVAL(1, 'YEAR'))
  ( PARTITION p0 VALUES LESS THAN (TO_DATE('1-1-1922', 'DD-MM-YYYY')),
    PARTITION p1 VALUES LESS THAN (TO_DATE('1-1-1923', 'DD-MM-YYYY')),
    PARTITION p2 VALUES LESS THAN (TO_DATE('1-1-1924', 'DD-MM-YYYY')),
    PARTITION p3 VALUES LESS THAN (TO_DATE('1-1-1925', 'DD-MM-YYYY'))
  )
;

Now after inserting all the data (1921-2017), I can see in SQL Developer that it automatically created lots of extra partitions. Nice! My experience with other databases is that this would need to be done manually, so I appreciate this feature.

SQL Developer view of the partitions

I once read a quote from a big data company CEO saying that for the sport of basketball, you could not calculate player metrics in close to real time with an old fart relational database - you needed his super-duper, high-scale big data infrastructure to do anything like that. Now I don't know much about basketball, so for all I know he's right, but I think baseball would give us a close enough comparison...

So my next mission is to aggregate this data in the form of materialized views, and get some player metrics such as batting average per year and of all time. I'm going to try to take advantage of the FAST REFRESH feature of materialized views to calculate these metrics in close to real time. Imagine data trickling into the PLAY table, and almost instantly we can update the player's batting average, and even tell you the rank of his batting average not only for the year, but for all time. I'll see if it can be done!

Saturday, February 17, 2018

Oracle Table Diffing with the Mwrynn PL/SQL Table API


After about two years, I just pushed to github an update to the Mwrynn PL/SQL Table API. I've added the basics of a new feature I've been thinking about for a while -- table diffing! Too many times I've wanted to compare two tables automatically, to discover what differs between the two. I plan to add diffing utils for columns, indexes, constraints and finally, the data itself. For now I've got column-based diffs working (edit: and data diffing!). Hooray!

If you'd like to try it out for yourself, go to https://github.com/mwrynn/plsql-table and download away!

Below is a small demo...

CREATE TABLE test1(a VARCHAR2(2) PRIMARY KEY, b NUMBER, c BLOB, e DATE);
CREATE TABLE test2(b NUMBER, a VARCHAR2(2) PRIMARY KEY, c VARCHAR2(20), d TIMESTAMP);

As a human being, let's compare these two tables.
1) Columns a and b exist in both tables, are the same types, but in different order.
2) Column c exists in both tables, but it's a a BLOB in test1, and a VARCHAR(20) in test2. 
3) Column e only exists in test1.
4) Column d only exists in test2.

Now let's run the test (yeah it sets a bad example to use the SYS schema...):

DECLARE
  test1 table_obj;
  test2 table_obj;
  result INT;
BEGIN
  test1 := table_obj('TEST1','SYS',NULL);
  test2 := table_obj('TEST2','SYS',NULL);
  
  result := test1.diff(test2);
END;
/

Now we query a special results table that is automatically created called DIFF_RESULTS, and voila, we can see the diffs!

"ID" "DIFF_RAN_ON" "TABLE1" "TABLE2" "DIFF_TYPE" "RESULT"

1 17-FEB-18 03.56.56.181538000 PM "SYS.TEST1" "SYS.TEST2" "C" "SYS.TEST1.A has column_id=1 but SYS.TEST2.A has column_id=2;SYS.TEST1.B has column_id=2 but SYS.TEST2.B has column_id=1;SYS.TEST1.C has data_type=BLOB but SYS.TEST2.C has data_type=VARCHAR2, SYS.TEST1.C has data_length=4000 but SYS.TEST2.C has data_length=20;column SYS.TEST1.E not found in SYS.TEST2, column SYS.TEST2.D not found in SYS.TEST1"

As you can see, the RESULT column aligns with the differences that I noted as a human. w00t! Additionally, it tells us the difference in data_length for column C -- 4000 for the CLOB and 20 for the VARCHAR2(20).

NEXT STEP -- I'm most excited about comparing the data between two tables, so I think I will work on that next...Stay tuned!

EDIT: I've now added support for data diffing as well! When compare_data is set to true, DIFF_RESULTS will now provide the counts of rows in one table but not the other, and vice versa. An unfortunate downside is that Large Object (LOBs) are not supported by set operators, and the data diffing requires the MINUS set operator. So I have to say, at least for now, that LOBs are not supported...

Example of data diffing:

CREATE TABLE test1(a VARCHAR2(2) PRIMARY KEY, b NUMBER, c VARCHAR2(4000), e DATE);
CREATE TABLE test1a(a VARCHAR2(2) PRIMARY KEY, b NUMBER, c VARCHAR2(4000), e DATE);

/* populate tables */
INSERT INTO test1 VALUES ('ab', 3, 'NRIOSANFSDF', to_date('2018-02-17', 'yyyy-mm-dd'));
INSERT INTO test1 VALUES ('cd', 4, 'SDAF', to_date('2018-02-17', 'yyyy-mm-dd'));
INSERT INTO test1 VALUES ('ef', 5, 'MBKDB', to_date('2018-02-17', 'yyyy-mm-dd'));
INSERT INTO test1 VALUES ('gh', 6, '001100101', to_date('2018-02-17', 'yyyy-mm-dd'));
INSERT INTO test1 VALUES ('ij', 5, '001100101', to_date('2018-02-16', 'yyyy-mm-dd'));

INSERT INTO test1a VALUES ('ab', 3, 'NRIOSANFSDF', to_date('2018-02-17', 'yyyy-mm-dd'));
INSERT INTO test1a VALUES ('cd', 4, 'SDAF', to_date('2018-02-17', 'yyyy-mm-dd'));
INSERT INTO test1a VALUES ('ef', 5, 'MBKDB', to_date('2018-02-17', 'yyyy-mm-dd'));
INSERT INTO test1a VALUES ('gh', 6, '001100101', to_date('2018-02-17', 'yyyy-mm-dd'));

INSERT INTO test1a VALUES ('ij', 4, '001100101', to_date('2018-02-16', 'yyyy-mm-dd'));

DECLARE
  test1 table_obj;
  test1a table_obj;
  result INT;
BEGIN
  test1 := table_obj('TEST1','SYS',NULL);
  test1a := table_obj('TEST1A','SYS',NULL);
  
  result := test1.diff(test1a, compare_columns => false, compare_data => true);
END;

/

And now the results...

"ID" "DIFF_RAN_ON" "TABLE1" "TABLE2" "DIFF_TYPE" "RESULT"
5 17-FEB-18 09.46.52.608436000 PM "SYS.TEST1" "SYS.TEST1A" "D" "1 rows in SYS.TEST1 not found in SYS.TEST1A;1 rows in SYS.TEST1A not found in SYS.TEST1"

That's all for now! Any questions/comments are welcome!

Thursday, February 15, 2018

Connor McDonald: "I DONT WANT TO KNOW WHAT MY SEQUENCE VALUES ARE!!!"



I like the discussion in the comments most of all. It's ok if your sequence has gaps! If you have a business requirement for gap-free sequences I 1) question the legitimacy of that requirement and 2) hope you enjoy having a huge inhibitor of insert scalability.