Mark W the SQL Geek: 2014

Saturday, September 20, 2014

Fark thread about Oracle

It was actually on the news of Larry Ellison stepping down, but the comments included some technical, uh, opinions. I could not comment myself, because apparently they close threads to comments very quickly.

But get a load of this:

Poster 1:

I wouldn't have deleted the record. I would have done something more like:  UPDATE tblOracleEmployees set ActiveEmp=0 WHERE strName = 'Larry Ellison' AND strJobTitle = 'CEO'  

Seriously... You don't delete records. That destroys all relational joins. And you will need to have those records on hand.  //Yes. I'm a data analyst.

Poster 2 (in reply to Poster 1):

Actually joining data tables at runtime ... what a quaint little database. I can't imagine adopting Oracle to store something considered Big Data these days. To make it work you're not allowed to ever run a join or use 99% of the features that make up SQL. Might as well just bite the bullet and use HDFS ... save some money anyway. And if you don't require that scale ... well then why not use something cheaper than Oracle? Their days have got to be numbered.

What I was going to post (but again, couldn't, because Fark closes comments quickly these days):

Nothing wrong with joining - relational databases were born to join. If the extra I/O really impedes your performance that much (I find this is usually overstated), Oracle has a lot of options to look at, I would probably physically colocate commonly joined data using table clustering. Another option might be materialized views, if the data is read much more than it is updated (although lately, i.e. as of 11g, Oracle’s query caching is pretty much “just in time materialized views”).

What’s really bothering me is the use of names like tblOracleEmployees -
this is looking like a Java naming convention and this is not Java - Oracle will ignore all the caps anyway, unless you surround the identifier in quotes, so that becomes TBLORACLEEMPLOYEES and that’s not a good practice. I don’t like prefixes like “tbl”, especially since tables are kind of the default object, but that's just my personal opinion. But if you do like it, typical convention is to use underscores. Whatever you decide please please don't use the quotes to make it "tblOracleEmployees" - as that will require every query from here till eternity to match your caps and surround the identifier in quotes.

Can you imagine? Joining data is soooo lame. I've actually encountered this opinions at jobs - developer using bad designs because they heard joins were slow. I fix the designs, joins work, everyone and everything is happy - ta da.

Sunday, August 17, 2014

Problem with JDBC driver for Oracle 12c

Oracle has changed some of the behavior of its JDBC driver for 12c. If autocommit is enabled and an explicit commit is execute, it blows up.

java.sql.SQLException: Could not commit with auto-commit set on

OK, fair enough - it doesn't make sense to commit with auto-commit on. Comments I've read state that the JDBC spec demands this behavior, and Oracle is finally catching up. The earlier JDBC drivers don't throw this error. I don't believe any other JDBC drivers (MS, etc.) throw this error!

Our application needs to be fixed now.

Silly little problem installing Oracle on EC2

My boss was attempting to install Oracle 12c on an EC2 instance running Linux but it was failing mysteriously, so he handed it off to me, the Database Guy (tm). The logs reported this error:

"ORA-21561: OID generation failed"

Wha'? After some head scratching and googling I learned this can happen if the host name you’re installing on is not resolvable. Boss said he did not enable DNS on the VPC (Virtual Private Cloud - an Amazon thing). I took a look at the Oracle install script (that was apparently generated by Amazon) and found that it’s getting the host name via a curl call, to retrieve meta-data Amazon provides for your EC2 instance: something like ORACLE_HOSTNAME=`curl -s http://xxx.xxx.xxx.xxx/latest/meta-data/local-hostname/`

One option would be to get rid of that call and set ORACLE_HOSTNAME to localhost. Another would be to add the EC2 host to /etc/hosts. I opted for the latter (I wasn’t convinced that was the only place that needed the host) then installation worked. (Yay!)

Monday, March 31, 2014

"In Which We Are Not Having Fun"

Speaking of Postgres materialized views...The guys at sparkfun (makers of Arduino as well other fine products) migrated from MySQL to Postgres for various reasons, and decided to use the newish materialized views. Apparently it was all fun and games until their number of orders increased by an order of magnitude, bringing MV refreshes to their knees. (These MVs keep track of their available stock, it seems.)

MVs in Postgres are very basic -- no incremental refreshes like Oracle has, so they're crunching all the numbers from square one every single time you say "refresh."

I don't know anything about the sparkfun guys' particular requirements or infrastructure, but I probably would not try to keep Postgres MVs of any complexity constantly up to date in a transactional system. I'm imagining they're trying to update the MVs any time the dependent tables are modified, so too many orders kill them. What if they accepted some data staleness and set the MVs to refresh on a timer (every x minutes)? If the view of stock is out of date by a bit, that might be OK as long as there's a check in place at checkout time. Just kind of wondering out loud here.

This situation shows that testing to scale is necessary, by the way.

On a related note, their reasons for switching from MySQL to Postgres are pretty numerous.

MariaDB 10 Release

Saw this on Slashdot this morning, after driving through a March 31 snowstorm. Replication slaves don't crash now...What a feature! Guess that was a problem. I should check out Maria one of these days.

Tuesday, March 18, 2014

PostgreSQL Materialized View Example

As I mentioned in an earlier post, PostgreSQL finally supports some very basic materialized views as of version 9.3, and this has me more excited than it should! I finally got a chance to give it a try.

Overview of Materialized Views

But first, what is a materialized view? Very briefly, it’s a view whose query results are stored. If a view is a stored query that spits results at you on demand, a materialized view is a view that stores the query results and spits those same results out at you on demand. The ordinary view works hard to get your results every time you query it, while the materialized view has already done the work for you in the past. View=slower, zero disk space; materialized view=faster; potentially lots of disk space.

If the query results change over time, then at some point the materialized view results will be “stale”. You must tell Postgres to refresh your materialized view from time to time. In more advanced databases, materialized views may refresh automatically, either when a dependent table is updated, on a regular schedule, or on demand. Postgres only supplies “on demand”.

A Simple Example: Baseball Stats

Let’s get right into an example. I like baseball, so the goal of this example will be to come up with team batting averages based on the players’ individual plate appearances. My data is very simple, and a plate appearance can either be a hit or an out.

So, we have these tables:

 /* each row represents a player on a team */  
 CREATE TABLE player (id SERIAL PRIMARY KEY, name VARCHAR, num SMALLINT, team VARCHAR(3));  
 /* each row represents a plate appearance where a player attempts to hit the ball. In our simplistic model, there are two possible outcomes: he either gets a hit or he is out */  
 CREATE TABLE pa (pa_id SERIAL PRIMARY KEY, player_id INT REFERENCES player(id), gamedate DATE, inning SMALLINT, result VARCHAR(1));

I have populated them with dummy data. The teams I used are the MLB American League East teams: New York Yankees (NYY), Boston Red Sox (BOS), Tampa Bay Rays (TB), Toronto Blue Jays (TOR) and Baltimore Orioles (BAL). 25 players for each team, getting 4 plate appearances per day over 180 days. These are just numbers I made up. I’m not attempting total realism here.

Now I will write a view to get the batting averages per team:

 CREATE OR REPLACE VIEW team_avg_v AS SELECT team, SUM(CASE WHEN result='H' THEN 1 ELSE 0 END)::float/COUNT(*)::float AS avg FROM pa JOIN player p ON pa.player_id=p.id GROUP BY team

The result:

 mwrynn=# select * from team_avg_v;  
  team |    avg      
 ------+-------------------  
  BAL | 0.252296296296296  
  NYY | 0.244592592592593  
  TB  | 0.246666666666667  
  BOS | 0.250444444444444  
  TOR | 0.248444444444444  
 (5 rows)  
 Time: 248.679

249 ms is good time, but imagine if we had data for *all* players for *all* teams, even for *all* years baseball was played! This simple view would possibly take hours to run, at least on my wimpy little laptop.

Enter the materialized view. All we have to do is the following to materialize this view:

 mwrynn=# CREATE MATERIALIZED VIEW team_avg_mv AS SELECT * FROM team_avg_v;  
 SELECT 5  
 Time: 190.997 ms

And now let’s try querying it:

 mwrynn=# select * from team_avg_mv;  
  team |    avg      
 ------+-------------------  
  BAL | 0.252296296296296  
  NYY | 0.244592592592593  
  TB  | 0.246666666666667  
  BOS | 0.250444444444444  
  TOR | 0.248444444444444  
 (5 rows)  
 Time: 0.755 ms

We went from 249 ms to less than a ms. Not bad, huh? :)

Data Staleness and Refreshing

Probably the biggest downside to using a materialized view in this manner is it can get stale. In the context of our example, this means we add some more plate appearances for our players, and the averages change a bit. Let’s query the plain old view after adding more data:

 mwrynn=# select * from team_avg_v ;  
  team |    avg      
 ------+-------------------  
  BAL | 0.248888888888889  
  NYY | 0.24862962962963  
  TB  | 0.250592592592593  
  BOS | 0.25037037037037  
  TOR | 0.250074074074074  
 (5 rows)  
 Time: 351.739 ms

Now let’s show that our materialized view is out of date:

 mwrynn=# select * from team_avg_mv ;  
  team |    avg      
 ------+-------------------  
  BAL | 0.252296296296296  
  NYY | 0.244592592592593  
  TB  | 0.246666666666667  
  BOS | 0.250444444444444  
  TOR | 0.248444444444444  
 (5 rows)  
 Time: 0.710 ms

We need to refresh it to get it up to speed:

 mwrynn=# refresh materialized view team_avg_mv;  
 REFRESH MATERIALIZED VIEW  
 Time: 360.579 ms

Our Exciting Conclusion

Note that the refresh took about the same amount of time as the plain old view took to query. That’s no coincidence - it’s doing the same work to get our query results (plus storing the data in the materialized view case, and if we had built any indexes on it, those would have to be updated accordingly). This could be a problem if the underlying data changes often, and it’s not acceptable to see slightly out-of-date data in our query results. So, like anything else, materialized views are a tool to use in the appropriate situation, not a one-size-fits-all magic bullet. Other databases such as Oracle have special incremental refresh modes that can speed up your refreshes - you can sort of let you have your cake and eat it too, but these advanced materialized views tend to get very complex. Maybe at some point I'll show you some of this fancy Oracle materialized view stuff. :)

Friday, February 28, 2014

Do you understand indexes?

Test Your SQL Performance Know-How in 3 Minutes - the author says 60% of those who take this little quiz fail! You can select a database of your choice and the quiz will grill you on knowing the basics of indexes and when a query can make good use of particular indexes.

I chose Postgres as my database of choice and passed with a 4 out of 5. I attribute my mistake to a misreading but whatever, I have to accept that I messed one up. :)

Friday, February 21, 2014

A simple but slightly weird Sybase issue

A QA guy came to pick my brain about a bug that was reported to him. A Sybase table had a VARCHAR(255). The bug was that when a too-large String was inserted via JDBC, the following exception was thrown:

com.sybase.jdbc3.jdbc.SybSQLException: Implicit conversion from datatype 'TEXT' to 'VARCHAR' is not allowed. Use the CONVERT function to run this query.

Mr. QA could not quite reproduce the problem. Whenever he tried to insert a large String, the data was truncated down to 255 bytes, but otherwise worked fine. The issue turned out to be when you attempt to insert a String larger than the maximum size allowed for a VARCHAR (32767 or thereabouts) the Sybase JDBC driver assumes that it must be a TEXT. Sort of makes sense since that's too big for a VARCHAR, right? So, without any explicit conversion, the TEXT String cannot "fit" into a VARCHAR column. Exception thrown!

From the user's perspective it's kind of funny - a too-big String truncates, a too-too-big String explodes!