Sunday, December 23, 2018

Oracle Oddity

Oracle:

Why do some of the USER_* views such as USER_CONSTRAINTS include the OWNER column, but USER_TAB_COLUMNS does not? The world may never know...

Wednesday, October 10, 2018

Lukas Eder on Why SQL Bind Variables are Important for Performance

Glad to see someone taking up the torch of Tom Kyte's evangelism on using bind variables! My only curiosity is if it's measurably important for performance reasons in other databases such as MySQL. Maybe I ought to test it myself!

https://blog.jooq.org/2018/04/12/why-sql-bind-variables-are-important-for-performance/#comment-245993

Sunday, September 30, 2018

Friday, September 14, 2018

Yet another performance problem solved...OR...Yet another weird MySQL quirk of lameness

My coworker recently came up to me asking me why his index wasn't being used for a particular query (this is on MySQL) - it was just looking up by some IDs and the query was so slow, causing our application's UI to perform poorly! He gave me a sample query. It looked something like this:

SELECT * FROM MYTABLE WHERE EXT_ID IN (11111, 22222, 33333, 44444);

I did a little bit of digging and found the following:
  • EXT_ID was actually a VARCHAR.
  • There was indeed an index on EXT_ID
  • EXPLAIN showed that the INDEX was not planned to be used
  • The table had about 15 million rows
  • The column was high cardinality (unique or very close to it, but no unique constraint as we could not guarantee true uniqueness)
  • The query took over 20 seconds to run
Seems like a straightforward case - no fancy joins, subqueries or any other complicating factors. Just a look up by a high cardinality column. So why wasn't the index being used?

"Hmm," I thought, "Why wouldn't we quote these VARCHARs? Maybe that has to do with it..."

So I added quotes, and voila - the query went from taking over 20 seconds to about .04 seconds. A speed improvement of something like 50,000%!

I sighed and commented that MySQL's planner was buggy, but it turns out this is not a bug! It is documented behavior. From the MySQL 8.0 documentation:
For comparisons of a string column with a number, MySQL cannot use an index on the column to look up the value quickly. If str_col is an indexed string column, the index cannot be used when performing the lookup in the following statement:
SELECT * FROM tbl_name WHERE str_col=1;
The reason for this is that there are many different strings that may convert to the value 1, such as '1'' 1', or '1a'.

So there you have it folks. If you're looking up by a VARCHAR column, even if it contains numbers, please quote your lookup value. BETTER YET: DON'T DO THIS AND BIND YOUR VARIABLES. This means if you're using JDBC (as one example), form your query as "WHERE str_col=?", use a PreparedStatement, and set the parameter appropriately. The developer who originally wrote this code (in Java) was just concatenating the list of integers (that should've been strings/varchars) and forgot the quotes.

Binding your variables is one of the most important things you can do, and saves you from so many headaches in countless ways. It saves you from having to quote your VARCHARs, it saves you from having to escape quotes and other special characters, it prevents SQL injection, and it may be critical to your database's health (I know this is true with Oracle, not sure about others.) Just make sure you're calling the PreparedStatement's setString() method and not setInt(). ;)

#usebindvariables

Friday, August 24, 2018

Lukas Eder's (of JOOQ fame) look into join elimination

I was really impressed with Lukas Eder's look into the various kinds of join elimination and which databases support them. I was surprised to see Oracle not #1 in this case, but not surprised to see MySQL in dead last.

https://blog.jooq.org/2017/09/01/join-elimination-an-essential-optimiser-feature-for-advanced-sql-usage/

Monday, February 19, 2018

Retrosheet/Oracle Project


I'm brushing up on my Oracle partitioning skills by loading up the Retrosheet baseball data into a table I'm calling "play" - it's called that because it has every bit of data down to the play level going back as far as 1921! (Except Retrosheet lacks data for years 1923-1924...Does anybody know if something happened in baseball in those two years?) Anyway it's not a HUGE amount of data - about 14 million rows - but I thought it would be a sizable enough data set to use for experimenting with partitions and perhaps materialized views.

My thought for partitioning was to partition by year using range/interval partitioning. This way I can tell Oracle to automatically create partitions as data is inserted. I created a few partitions manually as a baseline - I think I could've gotten away with just one. See the DDL below:

CREATE TABLE play ( 
  hometeam VARCHAR2(3),
  visteam VARCHAR2(3),
  gamedate DATE,
  gamenum NUMBER(2,0),
  inning NUMBER(2,0),
  batter VARCHAR2(8),
  pitcher VARCHAR2(8),
  result VARCHAR2(30),
  badj VARCHAR2(1)
)
PARTITION BY RANGE(gamedate)
INTERVAL(NUMTOYMINTERVAL(1, 'YEAR'))
  ( PARTITION p0 VALUES LESS THAN (TO_DATE('1-1-1922', 'DD-MM-YYYY')),
    PARTITION p1 VALUES LESS THAN (TO_DATE('1-1-1923', 'DD-MM-YYYY')),
    PARTITION p2 VALUES LESS THAN (TO_DATE('1-1-1924', 'DD-MM-YYYY')),
    PARTITION p3 VALUES LESS THAN (TO_DATE('1-1-1925', 'DD-MM-YYYY'))
  )
;

Now after inserting all the data (1921-2017), I can see in SQL Developer that it automatically created lots of extra partitions. Nice! My experience with other databases is that this would need to be done manually, so I appreciate this feature.

SQL Developer view of the partitions

I once read a quote from a big data company CEO saying that for the sport of basketball, you could not calculate player metrics in close to real time with an old fart relational database - you needed his super-duper, high-scale big data infrastructure to do anything like that. Now I don't know much about basketball, so for all I know he's right, but I think baseball would give us a close enough comparison...

So my next mission is to aggregate this data in the form of materialized views, and get some player metrics such as batting average per year and of all time. I'm going to try to take advantage of the FAST REFRESH feature of materialized views to calculate these metrics in close to real time. Imagine data trickling into the PLAY table, and almost instantly we can update the player's batting average, and even tell you the rank of his batting average not only for the year, but for all time. I'll see if it can be done!

Saturday, February 17, 2018

Oracle Table Diffing with the Mwrynn PL/SQL Table API


After about two years, I just pushed to github an update to the Mwrynn PL/SQL Table API. I've added the basics of a new feature I've been thinking about for a while -- table diffing! Too many times I've wanted to compare two tables automatically, to discover what differs between the two. I plan to add diffing utils for columns, indexes, constraints and finally, the data itself. For now I've got column-based diffs working (edit: and data diffing!). Hooray!

If you'd like to try it out for yourself, go to https://github.com/mwrynn/plsql-table and download away!

Below is a small demo...

CREATE TABLE test1(a VARCHAR2(2) PRIMARY KEY, b NUMBER, c BLOB, e DATE);
CREATE TABLE test2(b NUMBER, a VARCHAR2(2) PRIMARY KEY, c VARCHAR2(20), d TIMESTAMP);

As a human being, let's compare these two tables.
1) Columns a and b exist in both tables, are the same types, but in different order.
2) Column c exists in both tables, but it's a a BLOB in test1, and a VARCHAR(20) in test2. 
3) Column e only exists in test1.
4) Column d only exists in test2.

Now let's run the test (yeah it sets a bad example to use the SYS schema...):

DECLARE
  test1 table_obj;
  test2 table_obj;
  result INT;
BEGIN
  test1 := table_obj('TEST1','SYS',NULL);
  test2 := table_obj('TEST2','SYS',NULL);
  
  result := test1.diff(test2);
END;
/

Now we query a special results table that is automatically created called DIFF_RESULTS, and voila, we can see the diffs!

"ID" "DIFF_RAN_ON" "TABLE1" "TABLE2" "DIFF_TYPE" "RESULT"

1 17-FEB-18 03.56.56.181538000 PM "SYS.TEST1" "SYS.TEST2" "C" "SYS.TEST1.A has column_id=1 but SYS.TEST2.A has column_id=2;SYS.TEST1.B has column_id=2 but SYS.TEST2.B has column_id=1;SYS.TEST1.C has data_type=BLOB but SYS.TEST2.C has data_type=VARCHAR2, SYS.TEST1.C has data_length=4000 but SYS.TEST2.C has data_length=20;column SYS.TEST1.E not found in SYS.TEST2, column SYS.TEST2.D not found in SYS.TEST1"

As you can see, the RESULT column aligns with the differences that I noted as a human. w00t! Additionally, it tells us the difference in data_length for column C -- 4000 for the CLOB and 20 for the VARCHAR2(20).

NEXT STEP -- I'm most excited about comparing the data between two tables, so I think I will work on that next...Stay tuned!

EDIT: I've now added support for data diffing as well! When compare_data is set to true, DIFF_RESULTS will now provide the counts of rows in one table but not the other, and vice versa. An unfortunate downside is that Large Object (LOBs) are not supported by set operators, and the data diffing requires the MINUS set operator. So I have to say, at least for now, that LOBs are not supported...

Example of data diffing:

CREATE TABLE test1(a VARCHAR2(2) PRIMARY KEY, b NUMBER, c VARCHAR2(4000), e DATE);
CREATE TABLE test1a(a VARCHAR2(2) PRIMARY KEY, b NUMBER, c VARCHAR2(4000), e DATE);

/* populate tables */
INSERT INTO test1 VALUES ('ab', 3, 'NRIOSANFSDF', to_date('2018-02-17', 'yyyy-mm-dd'));
INSERT INTO test1 VALUES ('cd', 4, 'SDAF', to_date('2018-02-17', 'yyyy-mm-dd'));
INSERT INTO test1 VALUES ('ef', 5, 'MBKDB', to_date('2018-02-17', 'yyyy-mm-dd'));
INSERT INTO test1 VALUES ('gh', 6, '001100101', to_date('2018-02-17', 'yyyy-mm-dd'));
INSERT INTO test1 VALUES ('ij', 5, '001100101', to_date('2018-02-16', 'yyyy-mm-dd'));

INSERT INTO test1a VALUES ('ab', 3, 'NRIOSANFSDF', to_date('2018-02-17', 'yyyy-mm-dd'));
INSERT INTO test1a VALUES ('cd', 4, 'SDAF', to_date('2018-02-17', 'yyyy-mm-dd'));
INSERT INTO test1a VALUES ('ef', 5, 'MBKDB', to_date('2018-02-17', 'yyyy-mm-dd'));
INSERT INTO test1a VALUES ('gh', 6, '001100101', to_date('2018-02-17', 'yyyy-mm-dd'));

INSERT INTO test1a VALUES ('ij', 4, '001100101', to_date('2018-02-16', 'yyyy-mm-dd'));

DECLARE
  test1 table_obj;
  test1a table_obj;
  result INT;
BEGIN
  test1 := table_obj('TEST1','SYS',NULL);
  test1a := table_obj('TEST1A','SYS',NULL);
  
  result := test1.diff(test1a, compare_columns => false, compare_data => true);
END;

/

And now the results...

"ID" "DIFF_RAN_ON" "TABLE1" "TABLE2" "DIFF_TYPE" "RESULT"
5 17-FEB-18 09.46.52.608436000 PM "SYS.TEST1" "SYS.TEST1A" "D" "1 rows in SYS.TEST1 not found in SYS.TEST1A;1 rows in SYS.TEST1A not found in SYS.TEST1"

That's all for now! Any questions/comments are welcome!

Thursday, February 15, 2018

Connor McDonald: "I DONT WANT TO KNOW WHAT MY SEQUENCE VALUES ARE!!!"



I like the discussion in the comments most of all. It's ok if your sequence has gaps! If you have a business requirement for gap-free sequences I 1) question the legitimacy of that requirement and 2) hope you enjoy having a huge inhibitor of insert scalability.