[TriLUG] Awk question
Jeremy Portzer
jeremyp at pobox.com
Wed Aug 8 04:24:04 EDT 2007
Mark Freeze wrote:
>
> The tab-delimited file has the following fields:
>
> LCode, Packet, AccountNum, CustID, Balance.
>
> Example: (Using - for a delimiter instead of tabs)
> RXX - 0115 - 12345 - 15077 - 20.50
> RZA - 0115 - 12345 - 15077 - 35.00
> RET - 0117 - 54321 - 19004 - 70.23
> RET - 0117 - 54322 - 19100 - 20.00
> RET - 0117 - 54324 - 19700 - 115.99
> RET - 0117 - 54323 - 19233 - 78.50
> RAB - 0119 - 28033 - 21001 - 18.78
> BFR - 0110 - 78745 - 13701 - 5.00
> BFR - 0110 - 78749 - 19720 - 1.00
> BFR - 0110 - 78746 - 14500 - 8.23
> BFR - 0110 - 78748 - 45004 - 100.00
>
>
> I need to summarize the balance for the groups of records that have
> the same LCode and Packet, place the summarized total on the balance
> field of the record containing the highest AccountNum, and output only
> that record for the group. (And single records...) That operation
> would make a file that looked like this:
>
> RXX - 0115 - 12345 - 15077 - 20.50
> RZA - 0115 - 12345 - 15077 - 35.00
> RET - 0117 - 54324 - 19700 - 284.72
> RAB - 0119 - 28033 - 21001 - 18.78
> BFR - 0110 - 78749 - 19720 - 114.23
>
Then later, Mark wrote, in reference to processing the data via SQL instead:
> Delete all records with the same LCode, AccountNum, and CustID for a
> set of records. (Not the whole database, just the set of text records
> that you are importing.) Would you do it as a query after import or
> upon import.
This seems to be self-contradicting. Why do you say you want to
"delete" the records with the same code/account/custid ? Earlier you
said you are actually summing data with the same Lcode/Packet, placing
this on a data row that matches with the one with the highest
"accountnum", and also outputting the CustID that matches this record
with the highest accountnum.
But based on your original description, here's what I'd do in SQL:
First, I'll simplify it by noting that logically you are treating Lcode
and Packet as one entity, so I'm just going to pretend that's one
column, and my data table would have the following (adding ID as a
database-generated primary key, which isn't really needed but is a good
habit):
ID, LcodePacket, AccountNum, CustID, Balance
First, let's do summarization and ignore the requirement to match up
with the accoutNum/CustID. This is easy:
SELECT LcodePacket, SUM(Balance) as BalanceSum
FROM theTable
GROUP BY LcodePacket;
The next step is to develop a query that maps the "Lcodepacket" to the
account number/customer number that has the highest account number.
This took some thinking but I think the easiest way is with a
"correlated sub-query" as follows:
SELECT b.lcodepacket, b.accountnum, b.custid
FROM theTable b
WHERE b.accountnum = (SELECT MAX(accountnum) FROM theTable c WHERE
c.lcodepacket=b.lcodepacket)
This makes the assumption that a given CustID only appears with a given
accountnum/lcodepacket pair, which was the case in the sample data you gave.
Now, join the two queries together to get your final report. This
requires joining the table to itself again, and changing the WHERE from
the 2nd query into a HAVING clause:
SELECT b.Lcodepacket, b.accountnum, b.custid,
SUM(a.Balance) as BalanceSum
FROM theTable a
LEFT JOIN theTable b
ON a.LcodePacket = b.lcodepacket
GROUP BY b.accountnum,b.custid,b.lcodepacket
HAVING b.accountnum = (SELECT MAX(accountnum) FROM theTable c WHERE
c.lcodepacket=b.lcodepacket)
I may have over-thought this, but I tested the above with your sample
data and it gave me the right output, so I think this is correct. Note
- I tested with SQL Server as that is what I had in front of me, but
MySQL does support correlated subqueries so the above should work okay.
The question is how well the MySQL optimizer handles this to make sure
this query doesn't take too long in case of 40,000 rows, but I think it
should be okay. It will definitely be fine in either MS SQL or Oracle,
which is where my experience is, but I don't know too much about MySQL's
execution patterns.
I'd be curious to see what approach you were using that was taking 5
hours to process. If you're already doing the correlated subquery
approach, well, I apologize for taking your time. :-)
Hope this helps,
Jeremy Portzer
More information about the TriLUG
mailing list