Tuesday, March 03, 2015

Database Independence comment on Reddit, or, "What's a rant?"

I went into a bit of a rant on reddit the other day, in response to a comment about how using an ORM allows you to easily switch from one database to another. I'm just saving it here for posterity, as it took far too long to write. Here it is:

Original Post:
The only use I see for ORMs is if you might be switching database makers i.e. you know you're going to switch from mysql to postgres. The truth is most people don't use much of the database maker specific functions normally for it to really matter
My reply:
Even then there are many gotchas to be aware of. Just as one example, READ COMMITTED isolation works differently from one database to another, especially between databases that have multiversioning like Postgres and Oracle vs. those that do not. (SQL Server lets you optionally enable it these days, but last I looked it’s off by default.) The implications of just this one point can be huge. I’ve worked with many developers who claim their applications are “database agnostic” and in many cases I can prove that false by breaking it on them. (If all they're doing is inserting into an audit/reporting database, it's usually OK.) I’ve interviewed developers (as the resident database guy), and when I get to quizzing them on their ORM of choice, they claim it’s great, as they’ve interchanged databases seamlessly with no problems. I have trouble believing that they’ve tested in a multiuser setting very well. I actually feel stronger than maybe I should about this issue, so please forgive the rant! 
Let’s say you’re writing your own wiki web application, and when the user logs in, you show him the latest updates to articles since his last log in. This shouldn't be anything too complicated. You might have a “users” table that has username varchar primary key, last_login timestamp. Another is “articles”: id serial primary key, last_update timestamp, title varchar, etc... So, upon user login, the web app runs the query: SELECT title FROM articles WHERE last_update > (SELECT last_login FROM users WHERE username=:1); 
Let’s say a user Bob updates an article at a time that overlaps with another user, Steve, logging in. Let’s run through what happens in postgres: 
EXAMPLE 1: POSTGRES/MULTIVERSIONED DB 
Time = 12:00:01: Bob updates the title to article 123: UPDATE articles SET last_update=12:00:01, title=‘Dummy Title’ WHERE id=123;This transaction is taking a while. It doesn’t matter why, but maybe the server is running some maintenance, so it takes a good 10 seconds for the update to complete. 
Time = 12:00:03: While the update is running in another session, Steve logs in and the application runs the query we mentioned above. Steve’s last login was 10:00:00. This query does not see the 123 update, because the transaction hasn’t been committed yet. It is not blocked by the UPDATE. It sees the previously committed version of the row, where last_update was 09:00:00, perhaps, but does not select it (9:00 is not greater than 10:00, obviously). 
Time = 12:00:11: Bob’s UPDATE finishes and is committed. 
Time = 12:00:13: Steve’s query finishes, and the results are returned and displayed. Not included is the update to 123. There’s just one more step: update Steve’s last_login to the time of login, 12:00:03. Now next time Steve logs in, he will miss Bob's update. Therefore it’s missed forever!Now let’s run through what happens in a database that either lacks multiversioning, or it’s just not turned on. 
EXAMPLE 2: NON-MULTIVERSIONED DB 
Time = 12:00:01: Bob updates article 123: UPDATE articles SET last_update=12:00:01, title=‘Dummy Title’ WHERE id=123; 
As in our last example, this transaction is taking a while. 
Time = 12:00:03: While the update is running in another session, Steve logs in and the application runs the query we mentioned above. This query reads a whole bunch of records until it gets to 123, which is being updated by Bob. There is a blocking lock, therefore this session waits for the lock to be released. 
Time = 12:00:11: Bob’s UPDATE finishes and is committed and the lock is released. 
Time = 12:00:13: Steve’s query gets row 123, finishes, and the results are returned and displayed. It retrieves 123 as the developer intended. Yay. Finally, we update Steve’s last_login to the time of login, 12:00:03. Looks like everything is peachy!Ok what did we learn? Postgres is buggy and the other was great? No, Postgres has multiversioning which is great for scalability, but you have to be aware of it. We learned that this logic needs to be coded entirely differently in one database vs. another. I suppose you could code for the lowest common denominator and find a way to make the updates block reads universally, but then you’ve just decreased performance, scalability, and increased the chance of deadlocks. Imagine you have 1000 users logged in, and if even just 1% of them are updating their articles at any given time…Ouch. 
There might actually be a better way to have your cake and eat it too, but certainly not if you code generically. Maybe when you set last_login (which we did for Steve at time=12:00:13), you query a postgres system catalog (pg_stat_activity?) to get the currently running transactions on the articles table. Then you set last_login to min([real login time], [min transaction time on articles table]). So in this example, the value to set would be the update’s start time, 12:00:01. We also need to change “WHERE last_update >” to “WHERE last_update >=“. Upon next login, Steve's session sees 12:00:01 >= 12:00:01 and voila, it worked. [Maybe rename last_login to last_article_read or something better, at this point.]
Is there a better solution? Maybe. But the point is databases are all different, and even if you think you're not using any fancy, db-specific features, because your SQL syntax is basic and simple, you're still most likely using fancy, db-specific features. And you need to be aware of it before you can switch databases seamlessly. They have fundamental differences, and the example above was only illustrating one of them. (There's definitely more to talk about on locking, for example does database X do row-level locks, page-level locks, lock escalation...?) 

No comments: