[TriLUG] Was: Awk question Now: Awk, Perl, SQL

Wed Aug 8 09:55:43 EDT 2007

Hi Jeremy,
Thanks for the SQL code.  We are going to give it a go this afternoon
to see how it improves performance.

Deleteing duplicates within the database is a seperate step that we
take before we summarize the packeted data. We delete the duplicate
records and then summarize the remaining data.  The reason that we
import the dups in the first place is because we also do some
calculation with them before we delete them. (We are now doing these
calculations in Perl before the SQL import.)

Thanks to everyone for the ideas.  (Especially Robert who just seems
interested in pointing out that we've written some bad code. Very
helpful Robert, very helpful...)

Also, I'd still be interested in seeing if anyone would like to throw
us a bone with an awk script to test.

Regards,
Mark.

On 8/8/07, Jeremy Portzer <jeremyp at pobox.com> wrote:
> Mark Freeze wrote:
> >
> > The tab-delimited file has the following fields:
> >
> > LCode, Packet, AccountNum, CustID, Balance.
> >
> > Example: (Using - for a delimiter instead of tabs)
> > RXX - 0115 - 12345 - 15077 - 20.50
> > RZA - 0115 - 12345 - 15077 - 35.00
> > RET - 0117 - 54321 - 19004 - 70.23
> > RET - 0117 - 54322 - 19100 - 20.00
> > RET - 0117 - 54324 - 19700 - 115.99
> > RET - 0117 - 54323 - 19233 - 78.50
> > RAB - 0119 - 28033 - 21001 - 18.78
> > BFR - 0110 - 78745 - 13701 - 5.00
> > BFR - 0110 - 78749 - 19720 - 1.00
> > BFR - 0110 - 78746 - 14500 - 8.23
> > BFR - 0110 - 78748 - 45004 - 100.00
> >
> >
> > I need to summarize the balance for the groups of records that have
> > the same LCode and Packet, place the summarized total on the balance
> > field of the record containing the highest AccountNum, and output only
> > that record for the group. (And single records...) That operation
> > would make a file that looked like this:
> >
> > RXX - 0115 - 12345 - 15077 - 20.50
> > RZA - 0115 - 12345 - 15077 - 35.00
> > RET - 0117 - 54324 - 19700 - 284.72
> > RAB - 0119 - 28033 - 21001 - 18.78
> > BFR - 0110 - 78749 - 19720 - 114.23
> >
>
> Then later, Mark wrote, in reference to processing the data via SQL instead:
>
>  > Delete all records with the same LCode, AccountNum, and CustID for a
>  > set of records. (Not the whole database, just the set of text records
>  > that you are importing.) Would you do it as a query after import or
>  > upon import.
>
> This seems to be self-contradicting.  Why do you say you want to
> "delete" the records with the same code/account/custid ?   Earlier you
> said you are actually summing data with the same Lcode/Packet, placing
> this on a data row that matches with the one with the highest
> "accountnum", and also outputting the CustID that matches this record
> with the highest accountnum.
>
> But based on your original description, here's what I'd do in SQL:
> First, I'll simplify it by noting that logically you are treating Lcode
> and Packet as one entity, so I'm just going to pretend that's one
> column, and my data table would have the following (adding ID as a
> database-generated primary key, which isn't really needed but is a good
> habit):
>
> ID, LcodePacket, AccountNum, CustID, Balance
>
> First, let's do summarization and ignore the requirement to match up
> with the accoutNum/CustID.  This is easy:
>
> SELECT     LcodePacket, SUM(Balance) as BalanceSum
> FROM         theTable
> GROUP BY LcodePacket;
>
> The next step is to develop a query that maps the "Lcodepacket" to the
> account number/customer number that has the highest account number.
> This took some thinking but I think the easiest way is with a
> "correlated sub-query" as follows:
>
> SELECT b.lcodepacket, b.accountnum, b.custid
> FROM theTable b
> WHERE b.accountnum = (SELECT MAX(accountnum) FROM theTable c WHERE
> c.lcodepacket=b.lcodepacket)
>
> This makes the assumption that a given CustID only appears with a given
> accountnum/lcodepacket pair, which was the case in the sample data you gave.
>
> Now, join the two queries together to get your final report.  This
> requires joining the table to itself again, and changing the WHERE from
> the 2nd query into a HAVING clause:
>
> SELECT     b.Lcodepacket, b.accountnum, b.custid,
>            SUM(a.Balance) as BalanceSum
> FROM         theTable a
> LEFT JOIN    theTable b
> ON      a.LcodePacket  = b.lcodepacket
> GROUP BY b.accountnum,b.custid,b.lcodepacket
> HAVING  b.accountnum = (SELECT MAX(accountnum) FROM theTable c WHERE
> c.lcodepacket=b.lcodepacket)
>
>
> I may have over-thought this, but I tested the above with your sample
> data and it gave me the right output, so I think this is correct.  Note
> - I tested with SQL Server as that is what I had in front of me, but
> MySQL does support correlated subqueries so the above should work okay.
>
> The question is how well the MySQL optimizer handles this to make sure
> this query doesn't take too long in case of 40,000 rows, but I think it
> should be okay.  It will definitely be fine in either MS SQL or Oracle,
> which is where my experience is, but I don't know too much about MySQL's
> execution patterns.
>
> I'd be curious to see what approach you were using that was taking 5
> hours to process.  If you're already doing the correlated subquery
> approach, well, I apologize for taking your time.  :-)
>
> Hope this helps,
>
> Jeremy Portzer
> --
> TriLUG mailing list        : http://www.trilug.org/mailman/listinfo/trilug
> TriLUG Organizational FAQ  : http://trilug.org/faq/
> TriLUG Member Services FAQ : http://members.trilug.org/services_faq/
>