[TriLUG] Awk question

Wed Aug 8 07:58:34 EDT 2007

Regardless, using a RDBMS for this is overkill.

The data is already stored in a table - flat file.
The data would use and fit in a trivial amount of RAM.
He is only concerned with the processed output.
Processing is done only once and historical data is not needed (see point 1)

As to why MySQL takes so long, I can only suspect that whatever
processing code exists is very poorly implemented.  40k records, <30
bytes each, is trivial.  The fact that there is some stored procedure
to remove dupes instead of using a primary key or unique index to
prevent insertion in the first place says a lot.

If MySQL is really that broken, you could always move to PostgreSQL ;-)

-- 
Robert Dale

On 8/8/07, Jeremy Portzer <jeremyp at pobox.com> wrote:
> Mark Freeze wrote:
> >
> > The tab-delimited file has the following fields:
> >
> > LCode, Packet, AccountNum, CustID, Balance.
> >
> > Example: (Using - for a delimiter instead of tabs)
> > RXX - 0115 - 12345 - 15077 - 20.50
> > RZA - 0115 - 12345 - 15077 - 35.00
> > RET - 0117 - 54321 - 19004 - 70.23
> > RET - 0117 - 54322 - 19100 - 20.00
> > RET - 0117 - 54324 - 19700 - 115.99
> > RET - 0117 - 54323 - 19233 - 78.50
> > RAB - 0119 - 28033 - 21001 - 18.78
> > BFR - 0110 - 78745 - 13701 - 5.00
> > BFR - 0110 - 78749 - 19720 - 1.00
> > BFR - 0110 - 78746 - 14500 - 8.23
> > BFR - 0110 - 78748 - 45004 - 100.00
> >
> >
> > I need to summarize the balance for the groups of records that have
> > the same LCode and Packet, place the summarized total on the balance
> > field of the record containing the highest AccountNum, and output only
> > that record for the group. (And single records...) That operation
> > would make a file that looked like this:
> >
> > RXX - 0115 - 12345 - 15077 - 20.50
> > RZA - 0115 - 12345 - 15077 - 35.00
> > RET - 0117 - 54324 - 19700 - 284.72
> > RAB - 0119 - 28033 - 21001 - 18.78
> > BFR - 0110 - 78749 - 19720 - 114.23
> >
>
> Then later, Mark wrote, in reference to processing the data via SQL instead:
>
>  > Delete all records with the same LCode, AccountNum, and CustID for a
>  > set of records. (Not the whole database, just the set of text records
>  > that you are importing.) Would you do it as a query after import or
>  > upon import.
>
> This seems to be self-contradicting.  Why do you say you want to
> "delete" the records with the same code/account/custid ?   Earlier you
> said you are actually summing data with the same Lcode/Packet, placing
> this on a data row that matches with the one with the highest
> "accountnum", and also outputting the CustID that matches this record
> with the highest accountnum.
>
> But based on your original description, here's what I'd do in SQL:
> First, I'll simplify it by noting that logically you are treating Lcode
> and Packet as one entity, so I'm just going to pretend that's one
> column, and my data table would have the following (adding ID as a
> database-generated primary key, which isn't really needed but is a good
> habit):
>
> ID, LcodePacket, AccountNum, CustID, Balance
>
> First, let's do summarization and ignore the requirement to match up
> with the accoutNum/CustID.  This is easy:
>
> SELECT     LcodePacket, SUM(Balance) as BalanceSum
> FROM         theTable
> GROUP BY LcodePacket;
>
> The next step is to develop a query that maps the "Lcodepacket" to the
> account number/customer number that has the highest account number.
> This took some thinking but I think the easiest way is with a
> "correlated sub-query" as follows:
>
> SELECT b.lcodepacket, b.accountnum, b.custid
> FROM theTable b
> WHERE b.accountnum = (SELECT MAX(accountnum) FROM theTable c WHERE
> c.lcodepacket=b.lcodepacket)
>
> This makes the assumption that a given CustID only appears with a given
> accountnum/lcodepacket pair, which was the case in the sample data you gave.
>
> Now, join the two queries together to get your final report.  This
> requires joining the table to itself again, and changing the WHERE from
> the 2nd query into a HAVING clause:
>
> SELECT     b.Lcodepacket, b.accountnum, b.custid,
>             SUM(a.Balance) as BalanceSum
> FROM         theTable a
> LEFT JOIN    theTable b
> ON      a.LcodePacket  = b.lcodepacket
> GROUP BY b.accountnum,b.custid,b.lcodepacket
> HAVING  b.accountnum = (SELECT MAX(accountnum) FROM theTable c WHERE
> c.lcodepacket=b.lcodepacket)
>
>
> I may have over-thought this, but I tested the above with your sample
> data and it gave me the right output, so I think this is correct.  Note
> - I tested with SQL Server as that is what I had in front of me, but
> MySQL does support correlated subqueries so the above should work okay.
>
> The question is how well the MySQL optimizer handles this to make sure
> this query doesn't take too long in case of 40,000 rows, but I think it
> should be okay.  It will definitely be fine in either MS SQL or Oracle,
> which is where my experience is, but I don't know too much about MySQL's
> execution patterns.
>
> I'd be curious to see what approach you were using that was taking 5
> hours to process.  If you're already doing the correlated subquery
> approach, well, I apologize for taking your time.  :-)
>
> Hope this helps,
>
> Jeremy Portzer
> --
> TriLUG mailing list        : http://www.trilug.org/mailman/listinfo/trilug
> TriLUG Organizational FAQ  : http://trilug.org/faq/
> TriLUG Member Services FAQ : http://members.trilug.org/services_faq/
>