Saturday, November 16, 2013

The Agony and the Ecstasy: Loading a 50G CSV file into an Oracle table Part I

Task: Load a 50+G CSV file into an Oracle database (running on Linux).

First problem: Only about 100G of free disk space was available on the Oracle server. Merely having a 50G CSV reside on the disk was taking up half! Is there a a way to import a compressed CSV (they are very-compressible)?

Enter named pipes.

Very briefly, named pipes allow for simple interprocess communication. I run one process and pipe output to the named pipe, which appears as a file in the filesystem. For my purpose, this allows me to unzip the huge CSV on the fly.

So how to set this up?
1) Upload mytable.csv.gz to the server running Oracle
2) Make a named pipe called mytable.csv
3) gunzip mytable.csv.gz and output to mytable.csv
4) From Oracle, load mytable.csv as if it were an ordinary CSV file. (I used an external table)

Now for some code so the above can be automated:

#!/bin/bash
#write_to_pipe.sh bash script:
#creates named pipe if non-existent and gunzips $1.gz to $1

pipe=$1

trap "rm -f $pipe" EXIT

if [[ ! -p $pipe ]]; then
    mkfifo $pipe
fi

gunzip -c $pipe.gz > $pipe


So, I run this script then query my external table that's hooked up to mytable.csv:

INSERT INTO mytable SELECT * FROM ext_mytable; --slightly simplified for illustrative purposes

...and voila! That’s it.

I hope someone else can enjoy this tip because I found it tremendously useful. By the way, the compressed CSV at 4.6G was less than 10% of the original CSV’s size! Nice…

I may write more of the hurdles and solutions involved in loading this big table into Oracle. But the named pipe solution was the neatest, coolest part, so I may leave it at that. :)

No comments: