Saturday, November 16, 2013

The Agony and the Ecstasy: Loading a 50G CSV file into an Oracle table Part I

Task: Load a 50+G CSV file into an Oracle database (running on Linux).

First problem: Only about 100G of free disk space was available on the Oracle server. Merely having a 50G CSV reside on the disk was taking up half! Is there a a way to import a compressed CSV (they are very-compressible)?

Enter named pipes.

Very briefly, named pipes allow for simple interprocess communication. I run one process and pipe output to the named pipe, which appears as a file in the filesystem. For my purpose, this allows me to unzip the huge CSV on the fly.

So how to set this up?
1) Upload mytable.csv.gz to the server running Oracle
2) Make a named pipe called mytable.csv
3) gunzip mytable.csv.gz and output to mytable.csv
4) From Oracle, load mytable.csv as if it were an ordinary CSV file. (I used an external table)

Now for some code so the above can be automated:

#!/bin/bash
#write_to_pipe.sh bash script:
#creates named pipe if non-existent and gunzips $1.gz to $1

pipe=$1

trap "rm -f $pipe" EXIT

if [[ ! -p $pipe ]]; then
    mkfifo $pipe
fi

gunzip -c $pipe.gz > $pipe


So, I run this script then query my external table that's hooked up to mytable.csv:

INSERT INTO mytable SELECT * FROM ext_mytable; --slightly simplified for illustrative purposes

...and voila! That’s it.

I hope someone else can enjoy this tip because I found it tremendously useful. By the way, the compressed CSV at 4.6G was less than 10% of the original CSV’s size! Nice…

I may write more of the hurdles and solutions involved in loading this big table into Oracle. But the named pipe solution was the neatest, coolest part, so I may leave it at that. :)

Postgres now supported on Amazon RDS

http://aws.amazon.com/rds/postgresql/

Finally!! Long story short time...

At work I had co-developed an application to run on Postgres - being the local database guy, I decided on Postgres as it is my preferred open source database. Fast forward about a year, and my boss said to me, "Mwrynn, it would really be nice to run your app on that RDS thing to make life easy for us when we go live. You know, RDS provides all those nice scalability, availability and management features." I agreed and converted the application to MySQL. This is not an automatic process just because we use a fancy ORM, and took a good deal of work. Then we went in production.

Now Postgres is available on RDS and I may have to convert back to Postgres again. The music running through my head: http://www.youtube.com/watch?v=1D5Sa2Yq-2g