[TriLUG] Clusters, performance, etc...

Mon Nov 7 22:02:31 EST 2005

Mark Freeze wrote:
> You guys are way ahead of me on some of the hardware questions... However,
> to try and answer some of them:
>  I have a script that controls the following actions:
>  1. Runs a c++ program that I wrote that opens a text file (the 50 - 100 MB
> file that I mentioned), reads each line sequentially and splits the data
> into two output files after performing numerous tasks to the data. (e.g.
> checking the validity of the zip code, making sure it matches the state,
> calculating amounts due, etc...
>  2. Makes the second file into a dbase file
>  3. Runs another c++ program on the first file that examines each record in
> the file and compares it to another database (using proprietary code
> libraries supplied by our software vendor) that corrects any bad info in the
> address, adds a zip+4, adds carrier route info, etc...
>  4. Looks for another text file to process
>  5. Appends all processed text files together
>  6. Appends all dbase files into one
>  As I said in my previous post, each 100MB text file takes about 1 hr to
> run. Most of this time is spent on step 3.
>  So, would clustering speed up this sometimes 3 - 4 hr process?
>  Thanks,
> Mark.

How much file space does your whole process use (while in motion)?
Sounds like it might only be a couple of hundred Megabytes. In that kind
of a situation try building a RAM disk and run the whole process from
that RAM disk.

I've done a lot of database intensive activities this way and gotten
speed increases of 10x by moving everything directly into RAM. That
would mean a 3 hour job would take about 18 minutes. That is a lot
better speed increase than you will get by using parallelization. 

Good Luck!

Jon Carnes