[TriLUG] Regexp Help

William Sutton william at trilug.org
Wed Aug 13 07:27:58 EDT 2014


I first tried the egrep idea last night before diving into Perl.  Part of 
that may be my bias towards Perl :-)

I came up with something in the neighborhood of cat |egrep, but ran into 
the multiple timestamps that Aaron noted.  I didn't think to do a 
secondary exclusionary grep.  It also occurred to me that I don't 
necessarily know what the other log contents are, that the blank line is, 
in fact, a separate blank line, and that with multiple records, it would 
likely still be there.

So, assuming the worst (that there might be some record constructs that 
weren't in the sample), I rewrote it in Perl to capture /exactly/ what was 
asked for and exclude anything else.

That said, and depending on your level of comfort with junk entries or 
extra spaces or whatever, Aaron's solution is definitely simpler and has 
the advantage of not having to be a stand-alone script.

William Sutton

On Wed, 13 Aug 2014, Aaron Joyner wrote:

> First off, William's solution is very clever and complete... but it strikes
> me as vastly over-complicated.  Perhaps you both share some extra knowledge
> about the contents of these logs that isn't evident in this post, but
> wouldn't the following shell snippet also accomplish roughly the same goal?
> $ egrep '2014$|ORA-' /tmp/text
> Tue Mar 25 19:48:43 2014
> Tue Mar 25 19:48:47 2014
> Tue Mar 25 19:52:17 2014
> ORA-12012: error on auto execute of job 216103
> ORA-30926: unable to get a stable set of rows in the source tables
> ORA-06512: at ENTREF.BGNPOSTLOAD, line 156
> ORA-06512: at line 2
> Tue Mar 25 19:56:51 2014
> Tue Mar 25 19:56:51 2014
> Tue Mar 25 19:56:51 2014
> $
>
> Yes, the simple 'egrep' solution doesn't suppress the timestamp lines which
> have no corresponding ORA- lines.  It sounds like you're still planning on
> eyeballing the output, so perhaps that's okay?  If not, you can do one
> better with just a touch more egrep:
>
> $ egrep '2014$|^ORA-' /tmp/text | egrep -B1 ^ORA- | grep -v '^--$'
> Tue Mar 25 19:52:17 2014
> ORA-12012: error on auto execute of job 216103
> ORA-30926: unable to get a stable set of rows in the source tables
> ORA-06512: at ENTREF.BGNPOSTLOAD, line 156
> ORA-06512: at line 2
> $
>
> Sure, those regexes are a *lot* simpler than William's, but assuming that
> the ^ORA- string is fairly unique, the second version should be pretty
> unlikely to turn up a false match.  It's just grabbing the closest previous
> line that looks like a timestamp.
>
> The moral of the story that I'm telling is that if you look carefully at
> most datasets, and know a couple of the more esoteric options of egrep (eg.
> multiple matching with |, -A, -B, -C), you can usually get what you need
> without resorting to heroic complication.
>
> Aaron S. Joyner
>
>
>
> Warning: This post goes really off the rails from here into my own personal
> late-night optimizations, tinkering, and a smidge of ranting.  Proceed with
> caution.
>
> PS - As an academic exercise, I briefly cooked a parser in shell similar to
> William's solution (it basically hinged on two tricks: [ ${line:0:4} ==
> ORA- ] and storing the previous line in $line), but it's sufficiently
> inelegant that I don't feel posting it here would really advance the
> discussion.  Curious readers might wish to give that approach a try.
>
> PPS - Aw, what the he.*.  I have a pet peeve with functional constructs
> such as map and lambda, a reasonable percentage of people (usually, myself
> included) don't find them intuitive.  Hence, my preference is to dump most
> of the lines we don't need up front with egrep, then hold just two lines of
> state in memory at a time:
> $ egrep '2014$|ORA-' /tmp/text | while read line; do
> if [ "${line:0:4}" == ORA- ]; then
>  echo $last;
> elif [ "${last:0:4}" == ORA- ]; then
>  echo $last;
> fi;
> last="$line";
> done
>
> PPPS - So... naturally, how about runtime and memory usage comparison?  :-D
> Note: shell is a horrible text parsing language for many reasons, speed
> not least among them.  Along the way, we'll see that simple regexes help
> dramatically, and that it's really hard to beat a well-tuned C program like
> egrep.
> $ for i in `seq 1 1000`; do cat /tmp/text >> /tmp/hugetext; done
> $ for i in `seq 1 100`; do cat /tmp/hugetext >> /tmp/text.100k.times; done
> $ ls -lh /tmp/text.100k.times
> -rw-r--r-- 1 asjoyner asjoyner 72M Aug 13 00:06 /tmp/text.100k.times$ wc -l
> text.100k.times
> 1800000 text.100k.times
> $ ln -s text.100k.times
> bar
>
> $ /usr/bin/time egrep '2014$|^ORA-' /tmp/text.100k.times | /usr/bin/time
> egrep -B1 ^ORA- | /usr/bin/time grep -v '^--$' | wc -l
> 0.44user 0.10system 0:01.05elapsed 52%CPU (0avgtext+0avgdata
> 696maxresident)k
> 0inputs+0outputs (0major+221minor)pagefaults 0swaps
> 0.18user 0.05system 0:01.05elapsed 22%CPU (0avgtext+0avgdata
> 716maxresident)k
> 0inputs+0outputs (0major+226minor)pagefaults 0swaps
> 0.15user 0.03system 0:01.06elapsed 17%CPU (0avgtext+0avgdata
> 712maxresident)k
> 0inputs+0outputs (0major+225minor)pagefaults 0swaps
> 500000
>
> $ /usr/bin/time ./omg.pl | wc -l
> 28.12user 0.71system 0:29.07elapsed 99%CPU (0avgtext+0avgdata
> 310008maxresident)k
> 0inputs+0outputs (0major+96785minor)pagefaults 0swaps
> 500000
>
> $ /usr/bin/time egrep '2014$|ORA-' /tmp/text.100k.times | /usr/bin/time
> /bin/bash -c '(while read line; do if [ "${line:0:4}" == ORA- ]; then
> echo $last; elif [ "${last:0:4}" == ORA- ]; then   echo $last; fi;
> last="$line"; done)' | wc -l
> 0.60user 0.12system 1:35.42elapsed 0%CPU (0avgtext+0avgdata 692maxresident)k
> 0inputs+0outputs (0major+220minor)pagefaults 0swaps
> 72.87user 17.83system 1:35.62elapsed 94%CPU (0avgtext+0avgdata
> 1068maxresident)k
> 0inputs+0outputs (0major+539minor)pagefaults 0swaps
> 500000
>
> The perl solution takes about 29 seconds, the grep-based solution takes
> about 1 second, the shell solution takes about 1m40s.  The above runs are
> just one sample of several, in various orders, to rule out disk caching and
> the other randomness associated with timesharing systems.
>
> Note that the map()-based perl solution uses ~300MB of RAM to process a
> 72MB log file, because it slurps the entire file into a single variable.
> That's not perl's fault, persay, just an artifact of reading in the whole
> file at once (and making a couple copies of it along the way), rather than
> processing it line by line.  By comparison the egrep/grep-based solution
> uses about 2MB of ram, and the combined egrep/bash solution weighs in at
> just under 2MB.
>
> PPPPS - Why can't I stop doing this?? I really should sleep... but if you
> transliterate my bash parser into perl, with the looser regexes, you get
> something both light weight on ram, and still pretty fast, though honestly
> not nearly as fast as I expected, it still takes 20 seconds:
> $ /usr/bin/time perl -ne 'if (/^ORA-|2014$/) { if (/^ORA-/) { print $last;
> } elsif ($last =~ /^ORA-/) { print $last; } $last = $_; }'
> /tmp/text.100k.times | wc -l
> 20.22user 0.10system 0:20.50elapsed 99%CPU (0avgtext+0avgdata
> 1596maxresident)k
> 0inputs+0outputs (0major+453minor)pagefaults 0swaps
> 500000
>
> ... there are just too many repetitive regex matches... let's do something
> more pearly if we're going to use perl...
>
> $ /usr/bin/time perl -ne 'if (/2014$/) { $ts = $_ } if (/^ORA-/) { if ($ts)
> { print $ts } print; undef $ts; }' /tmp/text.100k.times | wc -l
> 2.03user 0.08system 0:02.27elapsed 93%CPU (0avgtext+0avgdata
> 1592maxresident)k
> 0inputs+0outputs (0major+452minor)pagefaults 0swaps
> 500000
>
> There we go!  ~1.5MB of ram and about 2 seconds... can we use the same
> algorithm in bash?  how will it fare?
>
> $ cat /tmp/text.100k.times | /usr/bin/time /bin/bash -c 'while read line;
> do if [ "${line: -4:4}" == 2014 ]; then ts="$line"; fi; if [ "${line:0:4}"
> == ORA- ]; then if [ -n "$ts" ]; then echo "$ts"; fi; echo $line; ts="";
> fi; done' | wc -l
> 139.51user 38.58system 3:05.02elapsed 96%CPU (0avgtext+0avgdata
> 2744maxresident)k
> 0inputs+0outputs (0major+736minor)pagefaults 0swaps
> 500000
>
> Oh my, what have I done?  Run time is up north of 3 minutes... oh, I
> dropped the egrep, and now bash is hitting every line, and as discussed,
> it's awful at .. well basically everything.  Let's slip the egrep back in
> there and see how it fares.  It really should be more respectable than
> that...
>
> $ egrep '2014$|^ORA-' /tmp/text.100k.times | /usr/bin/time /bin/bash -c
> 'while read line; do if [ "${line: -4:4}" == 2014 ]; then ts="$line"; fi;
> if [ "${line:0:4}" == ORA- ]; then if [ -n "$ts" ]; then echo "$ts"; fi;
> echo $line; ts=""; fi; done' | wc -l
> 86.18user 19.46system 1:50.62elapsed 95%CPU (0avgtext+0avgdata
> 1176maxresident)k
> 0inputs+0outputs (0major+344minor)pagefaults 0swaps
> 500000
>
> It's still high, so I stare at it again, and notice that I inadvertently
> dropped the elif into just an if, meaning it's checking the first 4 chars
> of every timestamp line to see if it's an ORA line.. lets not do that...
>
> $ egrep '2014$|^ORA-' /tmp/text.100k.times | /usr/bin/time /bin/bash -c
> 'while read line; do if [ "${line: -4:4}" == 2014 ]; then ts="$line"; elif
> [ "${line:0:4}" == ORA- ]; then if [ -n "$ts" ]; then echo "$ts"; fi; echo
> $line; ts=""; fi; done' | wc -l
> 72.03user 18.48system 1:35.68elapsed 94%CPU (0avgtext+0avgdata
> 1180maxresident)k
> 0inputs+0outputs (0major+345minor)pagefaults 0swaps
> 500000
>
> Okay, that'll have to do.  1m35s, but it's really light on ram!  :-D
>
> Time to sleep...
> -- 
> This message was sent to: William <william at trilug.org>
> To unsubscribe, send a blank message to trilug-leave at trilug.org from that address.
> TriLUG mailing list : http://www.trilug.org/mailman/listinfo/trilug
> Unsubscribe or edit options on the web	: http://www.trilug.org/mailman/options/trilug/william%40trilug.org
> Welcome to TriLUG: http://trilug.org/welcome
>


More information about the TriLUG mailing list