[TriLUG] MySQL - Perl

Tue Jul 10 12:18:52 EDT 2007

Andrew suggests batching your data load to reduce line-by-line
overhead, William suggests improvements in the row-by-row process.
Both might help, but which is the problem?

You need to start with a problem determination step.

With any performance problem your first question should be "what took
so long?" not "do I need an index?".  I'd look at Time::HiRes and a
stats package to accumulate some statistics about each step in the
process - how much of your two hours was spent doing each step?

If it's all in one or two steps, then you can make a more intelligent
decision about why those steps might be slow.  If it's spread out like
peanut-butter over all the steps maybe you have a hardware issue (slow
disk?) or you need to look at the overall approach rather than
concentrate on any one thing.

Personally, I'm with Andrew - you're taking a one-at-the-time approach
to batch processing.  There should be no duplicates to start with, but
if you're going to have them you should sort them out in one pass -
throw them all into a temp table and let the database sort 'em 'cause
that's what it's good for.

On 7/9/07, Mark Freeze <mfreeze at gmail.com> wrote:
> Hi All,
> I have a question regarding an issue we are having with Perl & MySQL
> and I was wondering if anyone wanted to take a stab at it.
>
> We have a Perl script that parses text files we receive daily (around
> 1K to 25K records) into a MySQL database.
>
> The queries we issue from Perl perform the following actions: (All
> queries are limited to the records we just imported.)
>
> * Delete duplicate records within the database that have the same
> batch_id, product_code, and account_value, and retain only one record.
>
> * Examine the remaining records for sets that have the same batch_id
> and product_code, total the product_cost field for the set, place that
> amount in the record with the numerically highest account_value, then
> delete all of the other records.
>
> * Export all of the remaining records to a text file.
>
> * Mark the records as 'processed' within the database.
>
> The problem is this:  With this exact scenario, we imported around
> 15,000 records today.  After deleting duplicates and consolidating
> records, we exported around 5,000.  Perl & MySQL took around 2 hrs to
> perform this task.
>
> Now I could have this all wrong, however, I feel that this operation
> should have taken far, FAR, less time.  And, to compound things, our
> volume on this account is about to increase by around 150%.
>
> The delete_duplicates routine takes over an hour to run, and the
> summarize_accounts and export code takes over an hour.  The import is
> very quick.
>
> So, I was wondering if anyone would like to throw an idea out there as
> to what they believe the most efficient (read 'speedy') method would
> be to perform these tasks.
>
> Best Regards,
> Mark.
> --
> TriLUG mailing list        : http://www.trilug.org/mailman/listinfo/trilug
> TriLUG Organizational FAQ  : http://trilug.org/faq/
> TriLUG Member Services FAQ : http://members.trilug.org/services_faq/
>