[TriLUG] MSN bot is pounding my website...

Aaron S. Joyner aaron at joyner.ws
Thu Dec 9 17:10:24 EST 2004


gregbrown at mindspring.com wrote:

>for all hits
>cat access_log| awk '{ print $1 }' | sort | uniq -c | sort -gr | head
>  
>
Disclaimer: this adds nothing of value to the actual conversation at hand.

Just a style preference, but I prefer to use cut instead of awk, as it 
seems to be just the right tool for that particular job you're doing.  
In the example you give, the replacement cut command would be ... 
access_log | cut -f 1 -d\  | sort...  It also seems at first glance to 
me that awk, being a (much more capable, but correspondingly) heavier 
tool for the job would probably be slower at the task.  I setup a bit of 
an artificial test to determine one way or the other which one was more 
efficient.  I took a mail log with about 9 million lines in it, and 
cat'd it through each of the programs, throwing the output to /dev/null, 
and repeated the process three times to get a little bit of an average.

awk took about 54 seconds on average, cut took about 43.  awk spent 
about 25.5 seconds processing in user space for each run, cut spent 
about 6.5.  The difference of 11 seconds for both the real time, and 
user time spent, shows clearly the fact that awk is paying attention to 
the entire line when it reads it in, where as cut shortcuts when it has 
achieved it's goal of getting to the first space.  The rest of the time 
is simply how slow the disks are.  :)  For comparison, it took an 
average of 38 seconds to do a "wc -l" of this file.

So in short, even on really large inputs, it's not going to make more 
than 10-15 seconds worth of difference.  But if you're an efficiency 
nut, or dealing with ridiculous data sets, hopefully I added one more 
tool to your bag of text-mangling tricks.  :)

Aaron S. Joyner



More information about the TriLUG mailing list