[TriLUG] Clusters, performance, etc...

Michael Alan Dorman mdorman at debian.org
Mon Nov 7 21:46:41 EST 2005


Mark Freeze <mfreeze at gmail.com> writes:

> You guys are way ahead of me on some of the hardware questions... However,
> to try and answer some of them:
>  I have a script that controls the following actions:
>  1. Runs a c++ program that I wrote that opens a text file (the 50 - 100 MB
> file that I mentioned), reads each line sequentially and splits the data
> into two output files after performing numerous tasks to the data. (e.g.
> checking the validity of the zip code, making sure it matches the state,
> calculating amounts due, etc...
>  2. Makes the second file into a dbase file
>  3. Runs another c++ program on the first file that examines each record in
> the file and compares it to another database (using proprietary code
> libraries supplied by our software vendor) that corrects any bad info in the
> address, adds a zip+4, adds carrier route info, etc...
>  4. Looks for another text file to process
>  5. Appends all processed text files together
>  6. Appends all dbase files into one
>  As I said in my previous post, each 100MB text file takes about 1 hr to
> run. Most of this time is spent on step 3.
>  So, would clustering speed up this sometimes 3 - 4 hr process?

It certainly sounds like your process could benefit from bits running
in parallel.  Probably a lot, but that does depend in part on exactly
what your current bottleneck is.

The simplest solution, and one that seems like it would be well suited
to your workflow would just be a queueing system.  As a file comes in,
it is placed in the queue, and idle machines would periodically check
the queue for new work, and run a job when they find it.

If your files arrive infrequently, you would probably need to break up
each single file into multiple parts and then queue those to be worked
on.

There are systems out there that already generalize this idea, for
instance GnuQueue (http://savannah.gnu.org/projects/gnu-queue/), but
writing your own specialized manager isn't beyond the pale.

Mike
-- 
I'm leaving America, I'm taking a girl -- David Sylvian



More information about the TriLUG mailing list