Thursday, December 05, 2013

Unsolved Mystery: The datafile that wouldn't shrink

I had a problem the other day in reclaiming disk space on one of my Oracle development databases (10gR2).

Some background:

Made a tablespace delme with datafile delme.dbf.

Made a table mytable (which is very big) in tablespace delme, to use for just a little while, then it was to be dropped.

Done with table so: drop table mytable;

The problem:

Realizing I should have used "purge" in the DROP TABLE statement, I sigh and say OK, let's just run the Tom Kyte maxshrink script...There's absolutely nothing I want in this tablespace.

Crap, it's still using the space:

SQL> alter database datafile '/opt/oracle/oradata01/delme.dbf' resize 1m
  2  ;
alter database datafile '/opt/oracle/oradata01/delme.dbf' resize 1m
ERROR at line 1:
ORA-03297: file contains used data beyond requested RESIZE value

SQL> alter database datafile '/opt/oracle/oradata01/delme.dbf' resize 5m;
alter database datafile '/opt/oracle/oradata01/delme.dbf' resize 5m
ERROR at line 1:
ORA-03297: file contains used data beyond requested RESIZE value

SQL> alter database datafile '/opt/oracle/oradata01/delme.dbf' resize 100m;
alter database datafile '/opt/oracle/oradata01/delme.dbf' resize 100m
ERROR at line 1:
ORA-03297: file contains used data beyond requested RESIZE value

Hmmm....maybe the issue is the table wound up in the recycle bin. if I don't use "purge" the table gets put in the recycle bin. So let's find it there:
SQL> select object_name, original_name, type from recyclebin;

no rows selected what? Let's check if any objects might actually exist in the datafile:

select *
from (
select owner, segment_name, segment_type, block_id
from dba_extents
where file_id = ( select file_id
  from dba_data_files
where file_name = '/opt/oracle/oradata01/delme.dbf' )
  order by block_id desc
  2    3    4    5    6    7    8    9   10  ;

no rows selected


So now I'll just try dropping it, I guess. Here's how much space it took up before:

oracle@frd-db01:~/product/10.2/db_1> ll /opt/oracle/oradata01/delme.dbf
-rw-r----- 1 oracle oinstall 38722871296 2013-12-03 05:31 /opt/oracle/oradata01/delme.dbf

and now we drop tablespace:
SQL> drop tablespace delme including contents and datafiles;

Tablespace dropped.

oracle@frd-db01:~/product/10.2/db_1> df -h | grep /opt/oracle/oradata01
                      100G   56G   45G  56% /opt/oracle/oradata01

And yay, I have space again, but I just wish I had gotten to the bottom of the mystery. Unfortunately, I simply don't have the time for that right now. Does anyone in my vast readership know what might have happened? Comments are welcome!

Saturday, November 16, 2013

The Agony and the Ecstasy: Loading a 50G CSV file into an Oracle table Part I

Task: Load a 50+G CSV file into an Oracle database (running on Linux).

First problem: Only about 100G of free disk space was available on the Oracle server. Merely having a 50G CSV reside on the disk was taking up half! Is there a a way to import a compressed CSV (they are very-compressible)?

Enter named pipes.

Very briefly, named pipes allow for simple interprocess communication. I run one process and pipe output to the named pipe, which appears as a file in the filesystem. For my purpose, this allows me to unzip the huge CSV on the fly.

So how to set this up?
1) Upload mytable.csv.gz to the server running Oracle
2) Make a named pipe called mytable.csv
3) gunzip mytable.csv.gz and output to mytable.csv
4) From Oracle, load mytable.csv as if it were an ordinary CSV file. (I used an external table)

Now for some code so the above can be automated:

#!/bin/bash bash script:
#creates named pipe if non-existent and gunzips $1.gz to $1


trap "rm -f $pipe" EXIT

if [[ ! -p $pipe ]]; then
    mkfifo $pipe

gunzip -c $pipe.gz > $pipe

So, I run this script then query my external table that's hooked up to mytable.csv:

INSERT INTO mytable SELECT * FROM ext_mytable; --slightly simplified for illustrative purposes

...and voila! That’s it.

I hope someone else can enjoy this tip because I found it tremendously useful. By the way, the compressed CSV at 4.6G was less than 10% of the original CSV’s size! Nice…

I may write more of the hurdles and solutions involved in loading this big table into Oracle. But the named pipe solution was the neatest, coolest part, so I may leave it at that. :)

Postgres now supported on Amazon RDS

Finally!! Long story short time...

At work I had co-developed an application to run on Postgres - being the local database guy, I decided on Postgres as it is my preferred open source database. Fast forward about a year, and my boss said to me, "Mwrynn, it would really be nice to run your app on that RDS thing to make life easy for us when we go live. You know, RDS provides all those nice scalability, availability and management features." I agreed and converted the application to MySQL. This is not an automatic process just because we use a fancy ORM, and took a good deal of work. Then we went in production.

Now Postgres is available on RDS and I may have to convert back to Postgres again. The music running through my head:

Tuesday, September 10, 2013

New features in Postgres 9.3 (just released)

New features in PostgreSQL 9.3

I'm particularly excited about materialized views - a very useful feature I've used in Oracle for years. It looks like Postgres MVs will only have a fraction of the Oracle functionality, but hey it's a start!

Updatable views: also very nice. This feature can be useful when you have some tables that you want to make "private" and only expose a view based on them. An example off the top of my head... Maybe you have an archive_purchase table and current_purchase table - kind of like partitions - probably current_purchase is lean and mean and archive_purchase is old, huge and fat. You have some page in your web app that lets the user view and edit purchases from any time, so you make the view all_purchase that UNIONs the two. Now you simply query and update all_purchase without having to worry about where the underlying row comes from.

...And many more nice-looking new features.

100x faster Postgres performance by changing 1 line

From Datadog: 100x faster Postgres performance by changing 1 line

TL;DR: To check in a large list of values, use ANY(VALUES(...)) instead of ANY(ARRAY[...])

If this is a consistent issue, hopefully it's something Postgres can better optimize in the query planner.

Monday, August 12, 2013

Some good SQL tips

Two very nice articles on SQL tips for Java developers, or anyone who uses SQL, really: 

10 Common Mistakes Java Developers Make When Writing SQL

10 More Common Mistakes Java Developers Make When Writing SQL 

In Part 2, one tip is always to use ANSI joins. I tend to agree, but I just want to note that I have run into situations where the "old style" was required by Oracle. Maybe someone can correct me if this doesn't apply anymore, but back about 7 years ago when I really dug into materialized views, I found that query rewrite required the older syntax. Always thought that was a bit odd.

Tuesday, August 06, 2013

The world's most [x] open source database.

I'm working with both Postgres and MySQL at the moment, so I have documentation pages open for each. I just randomly noticed how similar their respective slogans are.

Postgres: "The world's most advanced open source database."
MySQL:  "The world's most popular open source database."

Telling, isn't it? :)

Thursday, April 25, 2013

Schema versioning in Postgres

Edit long after writing this post (4/2016): These days I use Liquibase! But let's keep the post below for posterity....

For a web application our team has been working on, I wrote a set of scripts that facilitate keeping our Postgres database schema up to date. Each developer has their own database instance, but I'm the guy in charge of managing schema changes. So, whenever the schema is changed, it's nice to have a version number for each so we can tell if an upgrade is required. (A similar approach could easily be used for databases other than Postgres, by the way.)

How it works:
We have a set of .sql scripts that are named like this:


Each script contains the necessary changes, typically ALTER TABLE statements and such. At the very end of each, the schema version is set in the form of two function definitions: get_schema_version_major() and get_schema_version_minor(). Each simply returns an int. I increment the major number in each  sql script that makes a change that, without which, the application would break. An example might be a dropped column. The minor version is incremented when a non-application-breaking change is made, such as a new constraint or index.

For each sql script there is a corresponding bash shell script as well as a Windows cmd script. Example: upgrade_[a.b]_to_[c.d].sh. This wraps up the call to psql so that the less-database-inclined can run it without hassle.

After a while of my coworkers using these upgrade scripts, there were complaints about how so many scripts had to be run! For example, maybe schema version 10.3 needed to be upgraded to 15.0 via several links in between. So I cooked up a new shell script that would intelligently traverse the chain of version numbers and automagically do it all for you. just run and voila! Here's what the output looks like:

Beginning schema upgrades, starting with version 10.3...
Upgrading from 10.3 to 11.0...
Upgrade successful.
Upgrading from 11.0 to 12.0...
Upgrade successful.
Upgrading from 12.0 to 13.0...
Upgrade successful.
Upgrading from 13.0 to 14.0...
Upgrade successful.
Upgrading from 14.0 to 15.0...
Upgrade successful.
Number of successful upgrades: 5. Version is now 15.0.

It won’t do anything it shouldn’t if I run it again…

mwrynn@mwrynnix:~$ ./

Beginning schema upgrades, starting with version 15.0...
Number of successful upgrades: 0. Version is now 15.0.

Here’s an error case. I started over from 10.3 and purposely added a syntax error to the 13.0_to_14.0 script just for the purpose of this test…

mwrynn@mwrynnix:~$ ./

Beginning schema upgrades, starting with version 10.3...
Upgrading from 10.3 to 11.0...
Upgrade successful.
Upgrading from 11.0 to 12.0...
Upgrade successful.
Upgrading from 12.0 to 13.0...
Upgrade successful.
Upgrading from 13.0 to 14.0...                         
psql:sbo_upgrade_13.0_to_14.0.sql:25: ERROR:  syntax error at or near "INT"
LINE 15: ...subscription ADD COLUMNBLAH pending_status_change INT REFERE...
Upgrade failed. Aborting.
Number of successful upgrades: 3. Version is now 13.0.

Finally, I’ve fixed that error and I’m resuming from where we were. Since each upgrade is an all-or-nothing transaction, it doesn’t matter that some of the statements in that bad script worked before the error occurred…

mwrynn@mwrynnix:~$ ./
Beginning schema upgrades, starting with version 13.0...
Upgrading from 13.0 to 14.0...
Upgrade successful.                                
Upgrading from 14.0 to 15.0...
Upgrade successful.
Number of successful upgrades: 2. Version is now 15.0.

Friday, March 15, 2013

No ORDER BY = No guarantee of order, period.

People won't seem to take Tom Kyte's word (and the Oracle docs' word) that if you don't use ORDER BY, results are not guaranteed to be returned in order. It's almost funny...

Thursday, March 14, 2013

OpenJPA headaches

I'm met with some incredulity every time I comment that fiddling with/working around the problems with O/R Mapper xyz wastes more of my time than it would take for me to handcode all my SQL. Here's another experience that makes me say just that.

Once upon a time, our Spring application used OpenJPA 2.1. The database was Postgres. Everything was happy until We encountered this bug which broke some of our app's queries: ...So, we upgraded to OpenJPA 2.2 which includes this fix.

Along with that fix, we got a brand new wonderful feature that we didn't ask for, that can't be disabled! This feature is ID caching, via a new allocationSize parameter (to the @SequenceGenerator annotation), which caches [allocationSize] number of IDs on the application side. 

OpenJPA sends ALTER SEQUENCE commands to the database in order to update the sequence appropriately - this is necessary for its caching scheme to work. They decided this new parameter should default to 50, so that's what we were using. You see when we upgraded to OpenJPA 2.2, we neglected to account for new features that might attempt to screw around with database objects behind our backs. The problem is that ALTER SEQUENCE can only be run by the sequence's owner, the application user is not the owner, and so an error is thrown. The OpenJPA docs say you can set the value of the allocationSize parameter to 1 to avoid using this feature, but when we tried that, it instead sent a command with a completely invalid syntax: "ALTER SEQUENCE mytable_id_seq"!

This invalid statement is fixed in OpenJPA 2.3, which has not been released at this time. We tested a SNAPSHOT build of 2.3, and found while it does correct the invalid syntax, it's still a problem. It sends an ALTER SEQUENCE statement when we don't want it to send any. Can't we just turn this thing off? Well, OpenJPA now has this feature that when this particular ALTER SEQUENCE statement fails, OpenJPA will just ignore it and log a warning. That's great and all but:
1) Now I have unnecessary "ERROR"s in my Postgres log. I monitor the log for "ERROR" and now this junk is cluttering it
2) One of my tables requires a special lock before modifying, therefore before we persist on the Java side, we begin a new transaction. So, this failed ALTER SEQUENCE statement winds up in the same transaction as the UPDATE/INSERT. Now, Postgres doesn't like when there's an error in your transaction - it forces all subsequent statements to also throw errors, and the only option is to rollback. The end result: we can't modify the data at all. It doesn't matter that the application pretends the first statement wasn't an error.

We worked around the issue by writing our own custom class for sequences, that subclasses AbstractJDBCSeq. We tell it to us a favor and not cache IDs in a way that won't work for us.

These ORMs tend to operate with the assumption that:
1) The ORM has total control over every database object
2) I actually want it to have total control over every database object. You know in the past I've worked with database clusters, each node replicated, where the databases need control over the sequences. For example, imagine such a cluster with three nodes. In order to avoid ID collisions, Node 1 might use IDs 1, 4, 7, ... Node 2 uses 2, 5, 8, ... and finally Node 3 uses 3, 6, 9, ... (This kind of configuration is not the norm, but it isn't all that uncommon either.) Now if you decide it's OK to start adding numbers to these sequences without my permission, well there goes the neighborhood.

If you're going to introduce a feature that requires a new set of database privileges, or does anything non-standard with database objects, at least make it optional!


Oracle 12c New Features

The least fancy of these features are the ones I'm most excited to see. I should say, I'm most relieved to see. These deal with some annoyances that felt absurd. For example I'd find myself screaming, "Why in this day and age do I still have to create a trigger for every ID?!?"

Here they are (these are quotes from the page I linked to):

2. VARCHAR2 length up to 32767
This one will be one of the best feature for developers who always struggle
to manage large chunk of data. Current version of databases allows only
up to 4000 bytes in a single varchar2 cell. So developers has to either use
CLOB or XML data types which are comparatively slower that varchar2

3. Default value can reference sequences
This is also for developers who struggle to maintain unique values in
Primary Key columns. While creating a table default column can be
referenced by sequence.nextval.

8. Boolean in SQL
As of Oracle 11g Boolean is not a supported data type in SQL and 12c you can
enjoy this feature.