[TriLUG] Regexp Help

Wed Aug 13 01:20:37 EDT 2014

First off, William's solution is very clever and complete... but it strikes
me as vastly over-complicated.  Perhaps you both share some extra knowledge
about the contents of these logs that isn't evident in this post, but
wouldn't the following shell snippet also accomplish roughly the same goal?
$ egrep '2014$|ORA-' /tmp/text
Tue Mar 25 19:48:43 2014
Tue Mar 25 19:48:47 2014
Tue Mar 25 19:52:17 2014
ORA-12012: error on auto execute of job 216103
ORA-30926: unable to get a stable set of rows in the source tables
ORA-06512: at ENTREF.BGNPOSTLOAD, line 156
ORA-06512: at line 2
Tue Mar 25 19:56:51 2014
Tue Mar 25 19:56:51 2014
Tue Mar 25 19:56:51 2014
$

Yes, the simple 'egrep' solution doesn't suppress the timestamp lines which
have no corresponding ORA- lines.  It sounds like you're still planning on
eyeballing the output, so perhaps that's okay?  If not, you can do one
better with just a touch more egrep:

$ egrep '2014$|^ORA-' /tmp/text | egrep -B1 ^ORA- | grep -v '^--$'
Tue Mar 25 19:52:17 2014
ORA-12012: error on auto execute of job 216103
ORA-30926: unable to get a stable set of rows in the source tables
ORA-06512: at ENTREF.BGNPOSTLOAD, line 156
ORA-06512: at line 2
$

Sure, those regexes are a *lot* simpler than William's, but assuming that
the ^ORA- string is fairly unique, the second version should be pretty
unlikely to turn up a false match.  It's just grabbing the closest previous
line that looks like a timestamp.

The moral of the story that I'm telling is that if you look carefully at
most datasets, and know a couple of the more esoteric options of egrep (eg.
multiple matching with |, -A, -B, -C), you can usually get what you need
without resorting to heroic complication.

Aaron S. Joyner

Warning: This post goes really off the rails from here into my own personal
late-night optimizations, tinkering, and a smidge of ranting.  Proceed with
caution.

PS - As an academic exercise, I briefly cooked a parser in shell similar to
William's solution (it basically hinged on two tricks: [ ${line:0:4} ==
ORA- ] and storing the previous line in $line), but it's sufficiently
inelegant that I don't feel posting it here would really advance the
discussion.  Curious readers might wish to give that approach a try.

PPS - Aw, what the he.*.  I have a pet peeve with functional constructs
such as map and lambda, a reasonable percentage of people (usually, myself
included) don't find them intuitive.  Hence, my preference is to dump most
of the lines we don't need up front with egrep, then hold just two lines of
state in memory at a time:
$ egrep '2014$|ORA-' /tmp/text | while read line; do
if [ "${line:0:4}" == ORA- ]; then
  echo $last;
elif [ "${last:0:4}" == ORA- ]; then
  echo $last;
fi;
last="$line";
done

PPPS - So... naturally, how about runtime and memory usage comparison?  :-D
 Note: shell is a horrible text parsing language for many reasons, speed
not least among them.  Along the way, we'll see that simple regexes help
dramatically, and that it's really hard to beat a well-tuned C program like
egrep.
$ for i in `seq 1 1000`; do cat /tmp/text >> /tmp/hugetext; done
$ for i in `seq 1 100`; do cat /tmp/hugetext >> /tmp/text.100k.times; done
$ ls -lh /tmp/text.100k.times
-rw-r--r-- 1 asjoyner asjoyner 72M Aug 13 00:06 /tmp/text.100k.times$ wc -l
text.100k.times
1800000 text.100k.times
$ ln -s text.100k.times
 bar

$ /usr/bin/time egrep '2014$|^ORA-' /tmp/text.100k.times | /usr/bin/time
egrep -B1 ^ORA- | /usr/bin/time grep -v '^--$' | wc -l
0.44user 0.10system 0:01.05elapsed 52%CPU (0avgtext+0avgdata
696maxresident)k
0inputs+0outputs (0major+221minor)pagefaults 0swaps
0.18user 0.05system 0:01.05elapsed 22%CPU (0avgtext+0avgdata
716maxresident)k
0inputs+0outputs (0major+226minor)pagefaults 0swaps
0.15user 0.03system 0:01.06elapsed 17%CPU (0avgtext+0avgdata
712maxresident)k
0inputs+0outputs (0major+225minor)pagefaults 0swaps
500000

$ /usr/bin/time ./omg.pl | wc -l
28.12user 0.71system 0:29.07elapsed 99%CPU (0avgtext+0avgdata
310008maxresident)k
0inputs+0outputs (0major+96785minor)pagefaults 0swaps
500000

$ /usr/bin/time egrep '2014$|ORA-' /tmp/text.100k.times | /usr/bin/time
/bin/bash -c '(while read line; do if [ "${line:0:4}" == ORA- ]; then
echo $last; elif [ "${last:0:4}" == ORA- ]; then   echo $last; fi;
last="$line"; done)' | wc -l
0.60user 0.12system 1:35.42elapsed 0%CPU (0avgtext+0avgdata 692maxresident)k
0inputs+0outputs (0major+220minor)pagefaults 0swaps
72.87user 17.83system 1:35.62elapsed 94%CPU (0avgtext+0avgdata
1068maxresident)k
0inputs+0outputs (0major+539minor)pagefaults 0swaps
500000

The perl solution takes about 29 seconds, the grep-based solution takes
about 1 second, the shell solution takes about 1m40s.  The above runs are
just one sample of several, in various orders, to rule out disk caching and
the other randomness associated with timesharing systems.

Note that the map()-based perl solution uses ~300MB of RAM to process a
72MB log file, because it slurps the entire file into a single variable.
 That's not perl's fault, persay, just an artifact of reading in the whole
file at once (and making a couple copies of it along the way), rather than
processing it line by line.  By comparison the egrep/grep-based solution
uses about 2MB of ram, and the combined egrep/bash solution weighs in at
just under 2MB.

PPPPS - Why can't I stop doing this?? I really should sleep... but if you
transliterate my bash parser into perl, with the looser regexes, you get
something both light weight on ram, and still pretty fast, though honestly
not nearly as fast as I expected, it still takes 20 seconds:
$ /usr/bin/time perl -ne 'if (/^ORA-|2014$/) { if (/^ORA-/) { print $last;
} elsif ($last =~ /^ORA-/) { print $last; } $last = $_; }'
/tmp/text.100k.times | wc -l
20.22user 0.10system 0:20.50elapsed 99%CPU (0avgtext+0avgdata
1596maxresident)k
0inputs+0outputs (0major+453minor)pagefaults 0swaps
500000

... there are just too many repetitive regex matches... let's do something
more pearly if we're going to use perl...

$ /usr/bin/time perl -ne 'if (/2014$/) { $ts = $_ } if (/^ORA-/) { if ($ts)
{ print $ts } print; undef $ts; }' /tmp/text.100k.times | wc -l
2.03user 0.08system 0:02.27elapsed 93%CPU (0avgtext+0avgdata
1592maxresident)k
0inputs+0outputs (0major+452minor)pagefaults 0swaps
500000

There we go!  ~1.5MB of ram and about 2 seconds... can we use the same
algorithm in bash?  how will it fare?

$ cat /tmp/text.100k.times | /usr/bin/time /bin/bash -c 'while read line;
do if [ "${line: -4:4}" == 2014 ]; then ts="$line"; fi; if [ "${line:0:4}"
== ORA- ]; then if [ -n "$ts" ]; then echo "$ts"; fi; echo $line; ts="";
fi; done' | wc -l
139.51user 38.58system 3:05.02elapsed 96%CPU (0avgtext+0avgdata
2744maxresident)k
0inputs+0outputs (0major+736minor)pagefaults 0swaps
500000

Oh my, what have I done?  Run time is up north of 3 minutes... oh, I
dropped the egrep, and now bash is hitting every line, and as discussed,
it's awful at .. well basically everything.  Let's slip the egrep back in
there and see how it fares.  It really should be more respectable than
that...

$ egrep '2014$|^ORA-' /tmp/text.100k.times | /usr/bin/time /bin/bash -c
'while read line; do if [ "${line: -4:4}" == 2014 ]; then ts="$line"; fi;
if [ "${line:0:4}" == ORA- ]; then if [ -n "$ts" ]; then echo "$ts"; fi;
echo $line; ts=""; fi; done' | wc -l
86.18user 19.46system 1:50.62elapsed 95%CPU (0avgtext+0avgdata
1176maxresident)k
0inputs+0outputs (0major+344minor)pagefaults 0swaps
500000

It's still high, so I stare at it again, and notice that I inadvertently
dropped the elif into just an if, meaning it's checking the first 4 chars
of every timestamp line to see if it's an ORA line.. lets not do that...

$ egrep '2014$|^ORA-' /tmp/text.100k.times | /usr/bin/time /bin/bash -c
'while read line; do if [ "${line: -4:4}" == 2014 ]; then ts="$line"; elif
[ "${line:0:4}" == ORA- ]; then if [ -n "$ts" ]; then echo "$ts"; fi; echo
$line; ts=""; fi; done' | wc -l
72.03user 18.48system 1:35.68elapsed 94%CPU (0avgtext+0avgdata
1180maxresident)k
0inputs+0outputs (0major+345minor)pagefaults 0swaps
500000

Okay, that'll have to do.  1m35s, but it's really light on ram!  :-D

Time to sleep...