[TriLUG] Awk question

Wed Aug 8 04:24:04 EDT 2007

Mark Freeze wrote:
> 
> The tab-delimited file has the following fields:
> 
> LCode, Packet, AccountNum, CustID, Balance.
> 
> Example: (Using - for a delimiter instead of tabs)
> RXX - 0115 - 12345 - 15077 - 20.50
> RZA - 0115 - 12345 - 15077 - 35.00
> RET - 0117 - 54321 - 19004 - 70.23
> RET - 0117 - 54322 - 19100 - 20.00
> RET - 0117 - 54324 - 19700 - 115.99
> RET - 0117 - 54323 - 19233 - 78.50
> RAB - 0119 - 28033 - 21001 - 18.78
> BFR - 0110 - 78745 - 13701 - 5.00
> BFR - 0110 - 78749 - 19720 - 1.00
> BFR - 0110 - 78746 - 14500 - 8.23
> BFR - 0110 - 78748 - 45004 - 100.00
> 
> 
> I need to summarize the balance for the groups of records that have
> the same LCode and Packet, place the summarized total on the balance
> field of the record containing the highest AccountNum, and output only
> that record for the group. (And single records...) That operation
> would make a file that looked like this:
> 
> RXX - 0115 - 12345 - 15077 - 20.50
> RZA - 0115 - 12345 - 15077 - 35.00
> RET - 0117 - 54324 - 19700 - 284.72
> RAB - 0119 - 28033 - 21001 - 18.78
> BFR - 0110 - 78749 - 19720 - 114.23
> 

Then later, Mark wrote, in reference to processing the data via SQL instead:

 > Delete all records with the same LCode, AccountNum, and CustID for a
 > set of records. (Not the whole database, just the set of text records
 > that you are importing.) Would you do it as a query after import or
 > upon import.

This seems to be self-contradicting.  Why do you say you want to 
"delete" the records with the same code/account/custid ?   Earlier you 
said you are actually summing data with the same Lcode/Packet, placing 
this on a data row that matches with the one with the highest 
"accountnum", and also outputting the CustID that matches this record 
with the highest accountnum.

But based on your original description, here's what I'd do in SQL:
First, I'll simplify it by noting that logically you are treating Lcode 
and Packet as one entity, so I'm just going to pretend that's one 
column, and my data table would have the following (adding ID as a 
database-generated primary key, which isn't really needed but is a good 
habit):

ID, LcodePacket, AccountNum, CustID, Balance

First, let's do summarization and ignore the requirement to match up 
with the accoutNum/CustID.  This is easy:

SELECT     LcodePacket, SUM(Balance) as BalanceSum
FROM         theTable
GROUP BY LcodePacket;

The next step is to develop a query that maps the "Lcodepacket" to the 
account number/customer number that has the highest account number. 
This took some thinking but I think the easiest way is with a 
"correlated sub-query" as follows:

SELECT b.lcodepacket, b.accountnum, b.custid
FROM theTable b
WHERE b.accountnum = (SELECT MAX(accountnum) FROM theTable c WHERE 
c.lcodepacket=b.lcodepacket)

This makes the assumption that a given CustID only appears with a given 
accountnum/lcodepacket pair, which was the case in the sample data you gave.

Now, join the two queries together to get your final report.  This 
requires joining the table to itself again, and changing the WHERE from 
the 2nd query into a HAVING clause:

SELECT     b.Lcodepacket, b.accountnum, b.custid,
            SUM(a.Balance) as BalanceSum
FROM         theTable a
LEFT JOIN    theTable b
ON 	a.LcodePacket  = b.lcodepacket
GROUP BY b.accountnum,b.custid,b.lcodepacket
HAVING  b.accountnum = (SELECT MAX(accountnum) FROM theTable c WHERE 
c.lcodepacket=b.lcodepacket)

I may have over-thought this, but I tested the above with your sample 
data and it gave me the right output, so I think this is correct.  Note 
- I tested with SQL Server as that is what I had in front of me, but 
MySQL does support correlated subqueries so the above should work okay.

The question is how well the MySQL optimizer handles this to make sure 
this query doesn't take too long in case of 40,000 rows, but I think it 
should be okay.  It will definitely be fine in either MS SQL or Oracle, 
which is where my experience is, but I don't know too much about MySQL's 
execution patterns.

I'd be curious to see what approach you were using that was taking 5 
hours to process.  If you're already doing the correlated subquery 
approach, well, I apologize for taking your time.  :-)

Hope this helps,

Jeremy Portzer