Thursday, March 19, 2015

Parallel Sequential Scan for PostgreSQL 9.5

Parallel Sequential Scan for PostgreSQL 9.5

Yes please!

Thursday, March 05, 2015

The mwrynn PL/SQL Table API

The mwrynn Table API is a PL/SQL object type I’ve written for Oracle. It wraps up some handy functions for generating SQL queries, and performing common operations such as “disable all indexes” that Oracle doesn’t provide conveniently out of the box. (I made it an object type instead of a package so it can be extended for a particular table, to the user's liking.) 

My initial inspiration was the the many times I’ve wanted the following construct to be available in SQL: Imagine you had a 50 column table, and you wanted to select col1 through 48, but not 49 and 50. I want to do this: SELECT *-(col49, col50) FROM mytable.

Typing it all out is Tedium City: SELECT col1, col2, col3, col4, col5, col6, col7, col8, col9, col10, col11, col12 [snip — we’re just 25% there]

At the same time, “SELECT *” can be a bad practice to use in your code. Generally fine for running ad-hoc queries, but I wouldn’t want to do that in a PL/SQL package that I plan on using in prod. So, my Table API would provide the entire select list (col1, col2, etc. etc.) for you, which you could just copy and paste. 

It kind of snowballed from there. I’ve got a number of functions, for example I built one that matches column names between two tables. This could be useful for merge/update statements — your external table linked to a flat file probably has column names matching the destination you want to copy to. So, this function would generate a.col1=b.col1, a.col2=b.col2, etc.

Another neat function generates random data. It automatically reads each column’s data type, and generates the appropriate type of data. I’ve found this useful for testing purposes, particularly although it has its limits of course (completely random values tend to have a high cardinality for example — maybe I’ll add optional cardinality constraints in the future!) I’ve found in the case of dates and timestamps it’s useful to have a date range limit, so I've implemented that. For numbers and strings well, there are no constraints at the moment. Foreign keys are generated by selecting a random PK value from the parent table.

Enough talk. Let’s see some action. Let’s run this PL/SQL block. Comments explain what I’m doing at each step. But first, we need a little DDL

 
 CREATE TABLE state (id VARCHAR2(2) PRIMARY KEY, name VARCHAR2(30));  
 INSERT INTO state SELECT 'NY', 'New York' FROM dual;  
 INSERT INTO state SELECT 'CA', 'California' FROM dual;  
 INSERT INTO state SELECT 'FL', 'Florida' FROM dual;  
 INSERT INTO state SELECT 'CO', 'Colorado' FROM dual;  

 CREATE TABLE person (id INT PRIMARY KEY, first_name VARCHAR2(10), last_name VARCHAR2(10), address VARCHAR2(20), city VARCHAR2(10), state VARCHAR2(2) REFERENCES state(id), zip VARCHAR2(5))   
 CREATE INDEX person_city_idx ON person(city);  
 CREATE INDEX person_zip_idx ON person(zip);   

 CREATE TABLE person_ext (id INT PRIMARY KEY, first_name VARCHAR2(10), last_name VARCHAR2(10), address VARCHAR2(20), city VARCHAR2(10), state VARCHAR2(2), zip VARCHAR2(5))   
 DECLARE  
  person_table_obj table_obj;  
  person_ext_table_obj table_obj;  
  cnt INTEGER;  
BEGIN
  --create table objects for mwrynn.person mwrynn.person_ext; third param null means local table, i.e. no dblink is used in qualifying table name
  person_table_obj := table_obj('PERSON', 'MWRYNN', null);
  person_ext_table_obj := table_obj('PERSON_EXT', 'MWRYNN', null);

  --display all the columns
  dbms_output.put_line(person_table_obj.all_cols);

  --disable all indexes and display number of successful disables
  dbms_output.put_line('Disabled ' || person_table_obj.disable_indexes || ' indexes.');

  --re-enable the indexes, and display number of successful renables
  dbms_output.put_line('Enabled ' || person_table_obj.enable_indexes || ' indexes.');
  
  --get list of columns to join on
  dbms_output.put_line(person_table_obj.join_list(person_ext_table_obj, 'a', 'b'));
  
  --get the list of matching columns, i.e. a.col1=b.col1, a.col2=b.col2, etc.
  dbms_output.put_line(person_table_obj.matched_update_list(person_ext_table_obj, 'a', 'b'));
  
  --generate random data, just 5 rows for our purposes. The foreign key column, state, will be looked up in the parent lookup table 
  cnt := person_table_obj.insert_random_rows(5, null, null); 
END;
 /  
 SELECT * FROM person;  

The output:

 anonymous block completed  
 ID,FIRST_NAME,LAST_NAME,ADDRESS,CITY,STATE,ZIP  
 Disabled 2 indexes.  
 Enabled 2 indexes.  
 a.ID=b.ID  
 a.FIRST_NAME=b.FIRST_NAME,a.LAST_NAME=b.LAST_NAME,a.ADDRESS=b.ADDRESS,a.CITY=b.CITY,a.STATE=b.STATE,a.ZIP=b.ZIP  
     ID FIRST_NAME LAST_NAME ADDRESS       CITY    STATE ZIP   
 ---------- ---------- ---------- -------------------- ---------- ----- -----  
 5041759599 DrUeedsMza cmUybFRdok ZyLBSrpNarfAsnNbHxzK xTkXtWdngA CA  bWjJs   
  885481864 VXNsOpNbwb POwZITUopy sqFQHHWcxFfcVNCLJyzq ysCdZKLKEQ FL  UcfkU   
 2357440692 vtDRoSaFXI JThrcRBQbL YrEyCAEAxXYErDInLciw SVdmHxtCNw CO  KxuIS   
 7331548629 pyKOLSGsTL NTlFxgpxAQ wsEZwiBlLaZAbsLenkeb IHWzBljRVL FL  LlhLA   
 4568868701 sftvDZyZpc HxEWPyenqR uPcMxkmJEMCgPYXMgUYq OrOZidpfKB CA  kpPgE   

First we've disabled and re-enabled the two indexes. (The primary key's index is never disabled.) Then we can see the join condition. Next is the output of matched_update_list -- the list of matching columns for the purpose of UPDATEs and MERGEs.

Finally we see the randomly generated data. As you can see, the values for the state column were pulled from the parent state table. The rest of the data is hideous but that's all it is -- random nonsense. If I had a lookup table of city/state/zip combinations, it could draw from that one. But I don't. :)

That's the mwrynn Table API for now. With such few columns this might not look so amazing, but it can really save time. I once had to work with about a dozen tables, each with about 20-40 columns, and write out my insert queries and such, plus generate test data. There are GUI tools that may provide some of the generation functionality, but I always found them cumbersome to use. All in all, this object type saved me quite a number of hours!

Note: I've added this API to Github! https://github.com/mwrynn/plsql-table

Wednesday, March 04, 2015

The Oracle section of my bookshelf...

Think I have enough Oracle books? I've read some of these cover to cover, even. The Tom Kyte ones are still great reading. I heavily used "Expert One on One" and "Beginning Oracle Programming" around 2003-2006 when I first started working with Oracle. I pretty much always had at least one of them open on my desk. :) They're still useful references, just have to be careful of out of data material at this point. I used to additionally have Jonathan Lewis' great performance tuning book, but I left it at another company... (Don't mind the little SQL Server book at the end. He just wants to be part of the group. And apparently you can learn all of SQL Server in 10 minutes. Wow.)

Tuesday, March 03, 2015

Database Independence comment on Reddit, or, "What's a rant?"

I went into a bit of a rant on reddit the other day, in response to a comment about how using an ORM allows you to easily switch from one database to another. I'm just saving it here for posterity, as it took far too long to write. Here it is:

Original Post:
The only use I see for ORMs is if you might be switching database makers i.e. you know you're going to switch from mysql to postgres. The truth is most people don't use much of the database maker specific functions normally for it to really matter
My reply:
Even then there are many gotchas to be aware of. Just as one example, READ COMMITTED isolation works differently from one database to another, especially between databases that have multiversioning like Postgres and Oracle vs. those that do not. (SQL Server lets you optionally enable it these days, but last I looked it’s off by default.) The implications of just this one point can be huge. I’ve worked with many developers who claim their applications are “database agnostic” and in many cases I can prove that false by breaking it on them. (If all they're doing is inserting into an audit/reporting database, it's usually OK.) I’ve interviewed developers (as the resident database guy), and when I get to quizzing them on their ORM of choice, they claim it’s great, as they’ve interchanged databases seamlessly with no problems. I have trouble believing that they’ve tested in a multiuser setting very well. I actually feel stronger than maybe I should about this issue, so please forgive the rant! 
Let’s say you’re writing your own wiki web application, and when the user logs in, you show him the latest updates to articles since his last log in. This shouldn't be anything too complicated. You might have a “users” table that has username varchar primary key, last_login timestamp. Another is “articles”: id serial primary key, last_update timestamp, title varchar, etc... So, upon user login, the web app runs the query: SELECT title FROM articles WHERE last_update > (SELECT last_login FROM users WHERE username=:1); 
Let’s say a user Bob updates an article at a time that overlaps with another user, Steve, logging in. Let’s run through what happens in postgres: 
EXAMPLE 1: POSTGRES/MULTIVERSIONED DB 
Time = 12:00:01: Bob updates the title to article 123: UPDATE articles SET last_update=12:00:01, title=‘Dummy Title’ WHERE id=123;This transaction is taking a while. It doesn’t matter why, but maybe the server is running some maintenance, so it takes a good 10 seconds for the update to complete. 
Time = 12:00:03: While the update is running in another session, Steve logs in and the application runs the query we mentioned above. Steve’s last login was 10:00:00. This query does not see the 123 update, because the transaction hasn’t been committed yet. It is not blocked by the UPDATE. It sees the previously committed version of the row, where last_update was 09:00:00, perhaps, but does not select it (9:00 is not greater than 10:00, obviously). 
Time = 12:00:11: Bob’s UPDATE finishes and is committed. 
Time = 12:00:13: Steve’s query finishes, and the results are returned and displayed. Not included is the update to 123. There’s just one more step: update Steve’s last_login to the time of login, 12:00:03. Now next time Steve logs in, he will miss Bob's update. Therefore it’s missed forever!Now let’s run through what happens in a database that either lacks multiversioning, or it’s just not turned on. 
EXAMPLE 2: NON-MULTIVERSIONED DB 
Time = 12:00:01: Bob updates article 123: UPDATE articles SET last_update=12:00:01, title=‘Dummy Title’ WHERE id=123; 
As in our last example, this transaction is taking a while. 
Time = 12:00:03: While the update is running in another session, Steve logs in and the application runs the query we mentioned above. This query reads a whole bunch of records until it gets to 123, which is being updated by Bob. There is a blocking lock, therefore this session waits for the lock to be released. 
Time = 12:00:11: Bob’s UPDATE finishes and is committed and the lock is released. 
Time = 12:00:13: Steve’s query gets row 123, finishes, and the results are returned and displayed. It retrieves 123 as the developer intended. Yay. Finally, we update Steve’s last_login to the time of login, 12:00:03. Looks like everything is peachy!Ok what did we learn? Postgres is buggy and the other was great? No, Postgres has multiversioning which is great for scalability, but you have to be aware of it. We learned that this logic needs to be coded entirely differently in one database vs. another. I suppose you could code for the lowest common denominator and find a way to make the updates block reads universally, but then you’ve just decreased performance, scalability, and increased the chance of deadlocks. Imagine you have 1000 users logged in, and if even just 1% of them are updating their articles at any given time…Ouch. 
There might actually be a better way to have your cake and eat it too, but certainly not if you code generically. Maybe when you set last_login (which we did for Steve at time=12:00:13), you query a postgres system catalog (pg_stat_activity?) to get the currently running transactions on the articles table. Then you set last_login to min([real login time], [min transaction time on articles table]). So in this example, the value to set would be the update’s start time, 12:00:01. We also need to change “WHERE last_update >” to “WHERE last_update >=“. Upon next login, Steve's session sees 12:00:01 >= 12:00:01 and voila, it worked. [Maybe rename last_login to last_article_read or something better, at this point.]
Is there a better solution? Maybe. But the point is databases are all different, and even if you think you're not using any fancy, db-specific features, because your SQL syntax is basic and simple, you're still most likely using fancy, db-specific features. And you need to be aware of it before you can switch databases seamlessly. They have fundamental differences, and the example above was only illustrating one of them. (There's definitely more to talk about on locking, for example does database X do row-level locks, page-level locks, lock escalation...?)