[TriLUG] Regexp Help

Alan Sterger asterger at earthlink.net
Thu Aug 14 17:08:40 EDT 2014


And the winner is...

Thank you William, Daniel, Aaron and anyone else who silently pondered 
my question.

William - I didn't think to include perl, as perl and I parted company 
somewhere after v4 as the direction seemed to be to stuff the language 
full of libraries.  To paraphrase Bones, "Damit Jim, I'm a DBA not a 
developer".  I did try your perl example and it indeed worked exactly as 
I had wanted.  I may try using your "dirty trick" within sed with tr to 
accomplish the same.

Aaron - the egrep solution didn't work for me as the timestamps were no 
longer associated with relevant ORA- lines.

If this helps: I don't mind piping the file through multiple 
"one-liners".  Also there will always be one timestamp per block of ORA- 
lines but the number of ORA- lines is unknown.  Also the line after the 
last ORA- line may or may not be an empty line. Again the "Errors in 
file" line isn't important and can be filtered.

Sigh... used to be better with regexp back in the day.  Contrary to 
today, there were fewer ways to do it.  Just System V or BSD. Life was good.

Cheers,

-- Alan S.


On 8/13/2014 12:00 PM, trilug-request at trilug.org wrote:
> Hello Listers,
>
> Analyzing an Oracle alert_log.  Familiar with sed, grep and regular 
> expressions but processing across multiple lines has always been my 
> nemesis.  What I'm trying to accomplish is to printout blocks of lines 
> to a smaller file for analysis.
>
> Thanks for your help,
>
> - Alan S.
>
> Alert log fragment:
> ******************************************************************
> Tue Mar 25 19:48:43 2014
>   Current log# 3 seq# 141753 mem# 1: F:\ORADATA\NCP15\REDO03B.LOG
> Tue Mar 25 19:48:47 2014
> ARC0: Completed archiving  log 2 thread 1 sequence 141752
> Tue Mar 25 19:52:17 2014
> Errors in file d:\oracle\admin\ncp15\udump\ncp15_j000_5824.trc:
> ORA-12012: error on auto execute of job 216103
> ORA-30926: unable to get a stable set of rows in the source tables
> ORA-06512: at "ENTREF.BGNPOSTLOAD", line 156
> ORA-06512: at line 2
>
> Tue Mar 25 19:56:51 2014
> Thread 1 advanced to log sequence 141754
> Tue Mar 25 19:56:51 2014
> ARC0: Evaluating archive   log 3 thread 1 sequence 141753
> Tue Mar 25 19:56:51 2014
>   Current log# 1 seq# 141754 mem# 0: F:\ORADATA\NCP15\REDO01A.LOG
>   Current log# 1 seq# 141754 mem# 1: F:\ORADATA\NCP15\REDO01B.LOG
> ******************************************************************
>
> Preferred output:
> Tue Mar 25 19:52:17 2014
> ORA-12012: error on auto execute of job 216103
> ORA-30926: unable to get a stable set of rows in the source tables
> ORA-06512: at "ENTREF.BGNPOSTLOAD", line 156
> ORA-06512: at line 2
>
>
> Needed information:
> DOW & timestamp line
> all ORA- lines
>
> "Errors in file" line can be removed/skipped as well as the ORA- 
> terminating whitespace line.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> Re: [TriLUG] Regexp Help.eml
>
> Subject:
> Re: [TriLUG] Regexp Help
> From:
> William Sutton <william at trilug.org>
> Date:
> 8/12/2014 5:21 PM
>
> To:
> Triangle Linux Users Group General Discussion <trilug at trilug.org>
>
>
> Assuming you have a file named "bar", which you could easily change in 
> the following Perl, it should do the trick for you.  It requires no 
> special libraries and is thoroughly overdocumented so everyone can 
> understand what I just did.  I've done this particular trick so many 
> times, it's second nature.  Code follows:
>
> #---
>
> # read in the file
> open(my $fh, "bar");
> my $f;
> while (<$fh>)
> {
>     $f .= $_;
> }
> close($fh);
>
> # regex match the date/time stamp; we need this twice, so define it once
> my $dt_re = '([a-z]{3}\s+[a-z]{3}\s+\d+\s+\d+(?::\d{2}){2}\s+\d{4})';
>
> # dirty trick; replace the matched date/time stamp with a known 
> nonprintable
> # character (\x01, or hex 01; anything unique would do), followed by 
> the same
> # date/time stamp we found before; this marks the beginning of the record
> $f =~ s/$dt_re/\x01$1/gi;
>
> # now the nasty magic; see per-line notes, with numeric ordering so 
> you can
> # follow what is happening
> # 9. and then print the result
> print
>
>   # 7. re-join the saved blocks on a newline (results in a single 
> block of text)
>   join(
>     "\n",
>
>     # 3. process the blocks from step 2 by
>     map {
>
>         # 6. then re-join the matching block lines on a newline so 
> that it
>         #    is one text block again
>         join(
>             "\n",
>
>             # 5. then grep for only lines with the date/time stamp (the
>             #    other place we use the ugly regex) or ORA- lines
>             grep { m/$dt_re|ORA-/i }
>
>               # 4. splitting them on a newline (now we have individual 
> lines
>               #    instead of blocks)
>               split(/\n/, $_)
>             )
>       }
>
>       # 2. then, grep out only those blocks that contain ORA- lines; any
>       #    other text blocks fall out at this point
>       grep { m/ORA-/ }
>
>       # 1. split the modified file text on our arbitrary \x01; this 
> creates
>       #    blocks of text that start with a date/time stamp
>       split(/\x01/, $f)
>   )
>
>   # 8. then tack on a newline at the end for formatting purposes
>   . "\n";
>
> #---
>
> William Sutton
>
> On Tue, 12 Aug 2014, Alan Sterger wrote:
>
>> Hello Listers,
>>
>> Analyzing an Oracle alert_log.  Familiar with sed, grep and regular 
>> expressions but processing across multiple lines has always been my 
>> nemesis. What I'm trying to accomplish is to printout blocks of lines 
>> to a smaller file for analysis.
>>
>> Thanks for your help,
>>
>> - Alan S.
>>
>> Alert log fragment:
>> ******************************************************************
>> Tue Mar 25 19:48:43 2014
>>  Current log# 3 seq# 141753 mem# 1: F:\ORADATA\NCP15\REDO03B.LOG
>> Tue Mar 25 19:48:47 2014
>> ARC0: Completed archiving  log 2 thread 1 sequence 141752
>> Tue Mar 25 19:52:17 2014
>> Errors in file d:\oracle\admin\ncp15\udump\ncp15_j000_5824.trc:
>> ORA-12012: error on auto execute of job 216103
>> ORA-30926: unable to get a stable set of rows in the source tables
>> ORA-06512: at "ENTREF.BGNPOSTLOAD", line 156
>> ORA-06512: at line 2
>>
>> Tue Mar 25 19:56:51 2014
>> Thread 1 advanced to log sequence 141754
>> Tue Mar 25 19:56:51 2014
>> ARC0: Evaluating archive   log 3 thread 1 sequence 141753
>> Tue Mar 25 19:56:51 2014
>>  Current log# 1 seq# 141754 mem# 0: F:\ORADATA\NCP15\REDO01A.LOG
>>  Current log# 1 seq# 141754 mem# 1: F:\ORADATA\NCP15\REDO01B.LOG
>> ******************************************************************
>>
>> Preferred output:
>> Tue Mar 25 19:52:17 2014
>> ORA-12012: error on auto execute of job 216103
>> ORA-30926: unable to get a stable set of rows in the source tables
>> ORA-06512: at "ENTREF.BGNPOSTLOAD", line 156
>> ORA-06512: at line 2
>>
>>
>> Needed information:
>> DOW & timestamp line
>> all ORA- lines
>>
>> "Errors in file" line can be removed/skipped as well as the ORA- 
>> terminating whitespace line.
>>
>>
>>
>>
>>
>
> Re: [TriLUG] Regexp Help.eml
>
> Subject:
> Re: [TriLUG] Regexp Help
> From:
> Daniel Sterling <sterling.daniel at gmail.com>
> Date:
> 8/12/2014 5:39 PM
>
> To:
> Triangle Linux Users Group General Discussion <trilug at trilug.org>
>
>
> assuming this is a single-use script, you can just use:
>
> undef $/;
> $f = <>;
>
> instead of the open + while loop. <> automatically opens and reads
> from either STDIN or the files listed on the command line
>
> seehttp://perldoc.perl.org/functions/readline.html
>
> -- Dan
>
>
>
> Re: [TriLUG] Regexp Help.eml
>
> Subject:
> Re: [TriLUG] Regexp Help
> From:
> William Sutton <william at trilug.org>
> Date:
> 8/12/2014 5:48 PM
>
> To:
> Triangle Linux Users Group General Discussion <trilug at trilug.org>
>
>
> True.  I could also have just used File::Slurp and been done with it :-)
>
> BTW, for the non-Perl folks, this is a good example of TIMTOWTDI.  It 
> is also a fine example of using split/join/map/grep to do in-line list 
> creation, processing, and reduction.
>
> The general idea is "I have a single thing (a log file) and I want 
> another single thing (a cleaner log file)."  I could go through the 
> trouble of explicitly creating arrays, populating them, iterating over 
> them, and recombining them into scalars...
>
> ...or I could just munge it all at once and let Perl deal with the 
> data manipulations.
>
> William Sutton
>
> On Tue, 12 Aug 2014, Daniel Sterling wrote:
>
>
>
> Re: [TriLUG] Regexp Help.eml
>
> Subject:
> Re: [TriLUG] Regexp Help
> From:
> Aaron Joyner <aaron at joyner.ws>
> Date:
> 8/13/2014 1:20 AM
>
> To:
> Triangle Linux Users Group General Discussion <trilug at trilug.org>
>
>
> First off, William's solution is very clever and complete... but it strikes
> me as vastly over-complicated.  Perhaps you both share some extra knowledge
> about the contents of these logs that isn't evident in this post, but
> wouldn't the following shell snippet also accomplish roughly the same goal?
> $ egrep '2014$|ORA-' /tmp/text
> Tue Mar 25 19:48:43 2014
> Tue Mar 25 19:48:47 2014
> Tue Mar 25 19:52:17 2014
> ORA-12012: error on auto execute of job 216103
> ORA-30926: unable to get a stable set of rows in the source tables
> ORA-06512: at ENTREF.BGNPOSTLOAD, line 156
> ORA-06512: at line 2
> Tue Mar 25 19:56:51 2014
> Tue Mar 25 19:56:51 2014
> Tue Mar 25 19:56:51 2014
> $
>
> Yes, the simple 'egrep' solution doesn't suppress the timestamp lines which
> have no corresponding ORA- lines.  It sounds like you're still planning on
> eyeballing the output, so perhaps that's okay?  If not, you can do one
> better with just a touch more egrep:
>
> $ egrep '2014$|^ORA-' /tmp/text | egrep -B1 ^ORA- | grep -v '^--$'
> Tue Mar 25 19:52:17 2014
> ORA-12012: error on auto execute of job 216103
> ORA-30926: unable to get a stable set of rows in the source tables
> ORA-06512: at ENTREF.BGNPOSTLOAD, line 156
> ORA-06512: at line 2
> $
>
> Sure, those regexes are a*lot*  simpler than William's, but assuming that
> the ^ORA- string is fairly unique, the second version should be pretty
> unlikely to turn up a false match.  It's just grabbing the closest previous
> line that looks like a timestamp.
>
> The moral of the story that I'm telling is that if you look carefully at
> most datasets, and know a couple of the more esoteric options of egrep (eg.
> multiple matching with |, -A, -B, -C), you can usually get what you need
> without resorting to heroic complication.
>
> Aaron S. Joyner
>
>
>
> Warning: This post goes really off the rails from here into my own personal
> late-night optimizations, tinkering, and a smidge of ranting.  Proceed with
> caution.
>
> PS - As an academic exercise, I briefly cooked a parser in shell similar to
> William's solution (it basically hinged on two tricks: [ ${line:0:4} ==
> ORA- ] and storing the previous line in $line), but it's sufficiently
> inelegant that I don't feel posting it here would really advance the
> discussion.  Curious readers might wish to give that approach a try.
>
> PPS - Aw, what the he.*.  I have a pet peeve with functional constructs
> such as map and lambda, a reasonable percentage of people (usually, myself
> included) don't find them intuitive.  Hence, my preference is to dump most
> of the lines we don't need up front with egrep, then hold just two lines of
> state in memory at a time:
> $ egrep '2014$|ORA-' /tmp/text | while read line; do
> if [ "${line:0:4}" == ORA- ]; then
>    echo $last;
> elif [ "${last:0:4}" == ORA- ]; then
>    echo $last;
> fi;
> last="$line";
> done
>
> PPPS - So... naturally, how about runtime and memory usage comparison?:-D
>   Note: shell is a horrible text parsing language for many reasons, speed
> not least among them.  Along the way, we'll see that simple regexes help
> dramatically, and that it's really hard to beat a well-tuned C program like
> egrep.
> $ for i in `seq 1 1000`; do cat /tmp/text >> /tmp/hugetext; done
> $ for i in `seq 1 100`; do cat /tmp/hugetext >> /tmp/text.100k.times; done
> $ ls -lh /tmp/text.100k.times
> -rw-r--r-- 1 asjoyner asjoyner 72M Aug 13 00:06 /tmp/text.100k.times$ wc -l
> text.100k.times
> 1800000 text.100k.times
> $ ln -s text.100k.times
>   bar
>
> $ /usr/bin/time egrep '2014$|^ORA-' /tmp/text.100k.times | /usr/bin/time
> egrep -B1 ^ORA- | /usr/bin/time grep -v '^--$' | wc -l
> 0.44user 0.10system 0:01.05elapsed 52%CPU (0avgtext+0avgdata
> 696maxresident)k
> 0inputs+0outputs (0major+221minor)pagefaults 0swaps
> 0.18user 0.05system 0:01.05elapsed 22%CPU (0avgtext+0avgdata
> 716maxresident)k
> 0inputs+0outputs (0major+226minor)pagefaults 0swaps
> 0.15user 0.03system 0:01.06elapsed 17%CPU (0avgtext+0avgdata
> 712maxresident)k
> 0inputs+0outputs (0major+225minor)pagefaults 0swaps
> 500000
>
> $ /usr/bin/time ./omg.pl | wc -l
> 28.12user 0.71system 0:29.07elapsed 99%CPU (0avgtext+0avgdata
> 310008maxresident)k
> 0inputs+0outputs (0major+96785minor)pagefaults 0swaps
> 500000
>
> $ /usr/bin/time egrep '2014$|ORA-' /tmp/text.100k.times | /usr/bin/time
> /bin/bash -c '(while read line; do if [ "${line:0:4}" == ORA- ]; then
> echo $last; elif [ "${last:0:4}" == ORA- ]; then   echo $last; fi;
> last="$line"; done)' | wc -l
> 0.60user 0.12system 1:35.42elapsed 0%CPU (0avgtext+0avgdata 692maxresident)k
> 0inputs+0outputs (0major+220minor)pagefaults 0swaps
> 72.87user 17.83system 1:35.62elapsed 94%CPU (0avgtext+0avgdata
> 1068maxresident)k
> 0inputs+0outputs (0major+539minor)pagefaults 0swaps
> 500000
>
> The perl solution takes about 29 seconds, the grep-based solution takes
> about 1 second, the shell solution takes about 1m40s.  The above runs are
> just one sample of several, in various orders, to rule out disk caching and
> the other randomness associated with timesharing systems.
>
> Note that the map()-based perl solution uses ~300MB of RAM to process a
> 72MB log file, because it slurps the entire file into a single variable.
>   That's not perl's fault, persay, just an artifact of reading in the whole
> file at once (and making a couple copies of it along the way), rather than
> processing it line by line.  By comparison the egrep/grep-based solution
> uses about 2MB of ram, and the combined egrep/bash solution weighs in at
> just under 2MB.
>
> PPPPS - Why can't I stop doing this?? I really should sleep... but if you
> transliterate my bash parser into perl, with the looser regexes, you get
> something both light weight on ram, and still pretty fast, though honestly
> not nearly as fast as I expected, it still takes 20 seconds:
> $ /usr/bin/time perl -ne 'if (/^ORA-|2014$/) { if (/^ORA-/) { print $last;
> } elsif ($last =~ /^ORA-/) { print $last; } $last = $_; }'
> /tmp/text.100k.times | wc -l
> 20.22user 0.10system 0:20.50elapsed 99%CPU (0avgtext+0avgdata
> 1596maxresident)k
> 0inputs+0outputs (0major+453minor)pagefaults 0swaps
> 500000
>
> ... there are just too many repetitive regex matches... let's do something
> more pearly if we're going to use perl...
>
> $ /usr/bin/time perl -ne 'if (/2014$/) { $ts = $_ } if (/^ORA-/) { if ($ts)
> { print $ts } print; undef $ts; }' /tmp/text.100k.times | wc -l
> 2.03user 0.08system 0:02.27elapsed 93%CPU (0avgtext+0avgdata
> 1592maxresident)k
> 0inputs+0outputs (0major+452minor)pagefaults 0swaps
> 500000
>
> There we go!  ~1.5MB of ram and about 2 seconds... can we use the same
> algorithm in bash?  how will it fare?
>
> $ cat /tmp/text.100k.times | /usr/bin/time /bin/bash -c 'while read line;
> do if [ "${line: -4:4}" == 2014 ]; then ts="$line"; fi; if [ "${line:0:4}"
> == ORA- ]; then if [ -n "$ts" ]; then echo "$ts"; fi; echo $line; ts="";
> fi; done' | wc -l
> 139.51user 38.58system 3:05.02elapsed 96%CPU (0avgtext+0avgdata
> 2744maxresident)k
> 0inputs+0outputs (0major+736minor)pagefaults 0swaps
> 500000
>
> Oh my, what have I done?  Run time is up north of 3 minutes... oh, I
> dropped the egrep, and now bash is hitting every line, and as discussed,
> it's awful at .. well basically everything.  Let's slip the egrep back in
> there and see how it fares.  It really should be more respectable than
> that...
>
> $ egrep '2014$|^ORA-' /tmp/text.100k.times | /usr/bin/time /bin/bash -c
> 'while read line; do if [ "${line: -4:4}" == 2014 ]; then ts="$line"; fi;
> if [ "${line:0:4}" == ORA- ]; then if [ -n "$ts" ]; then echo "$ts"; fi;
> echo $line; ts=""; fi; done' | wc -l
> 86.18user 19.46system 1:50.62elapsed 95%CPU (0avgtext+0avgdata
> 1176maxresident)k
> 0inputs+0outputs (0major+344minor)pagefaults 0swaps
> 500000
>
> It's still high, so I stare at it again, and notice that I inadvertently
> dropped the elif into just an if, meaning it's checking the first 4 chars
> of every timestamp line to see if it's an ORA line.. lets not do that...
>
> $ egrep '2014$|^ORA-' /tmp/text.100k.times | /usr/bin/time /bin/bash -c
> 'while read line; do if [ "${line: -4:4}" == 2014 ]; then ts="$line"; elif
> [ "${line:0:4}" == ORA- ]; then if [ -n "$ts" ]; then echo "$ts"; fi; echo
> $line; ts=""; fi; done' | wc -l
> 72.03user 18.48system 1:35.68elapsed 94%CPU (0avgtext+0avgdata
> 1180maxresident)k
> 0inputs+0outputs (0major+345minor)pagefaults 0swaps
> 500000
>
> Okay, that'll have to do.  1m35s, but it's really light on ram!:-D
>
> Time to sleep...
>
>
> Re: [TriLUG] Regexp Help.eml
>
> Subject:
> Re: [TriLUG] Regexp Help
> From:
> William Sutton <william at trilug.org>
> Date:
> 8/13/2014 7:27 AM
>
> To:
> Triangle Linux Users Group General Discussion <trilug at trilug.org>
>
>
> I first tried the egrep idea last night before diving into Perl.  Part 
> of that may be my bias towards Perl :-)
>
> I came up with something in the neighborhood of cat |egrep, but ran 
> into the multiple timestamps that Aaron noted.  I didn't think to do a 
> secondary exclusionary grep.  It also occurred to me that I don't 
> necessarily know what the other log contents are, that the blank line 
> is, in fact, a separate blank line, and that with multiple records, it 
> would likely still be there.
>
> So, assuming the worst (that there might be some record constructs 
> that weren't in the sample), I rewrote it in Perl to capture /exactly/ 
> what was asked for and exclude anything else.
>
> That said, and depending on your level of comfort with junk entries or 
> extra spaces or whatever, Aaron's solution is definitely simpler and 
> has the advantage of not having to be a stand-alone script.
>
> William Sutton
>
> On Wed, 13 Aug 2014, Aaron Joyner wrote:
>
>> First off, William's solution is very clever and complete... but it 
>> strikes
>> me as vastly over-complicated.  Perhaps you both share some extra 
>> knowledge
>> about the contents of these logs that isn't evident in this post, but
>> wouldn't the following shell snippet also accomplish roughly the same 
>> goal?
>> $ egrep '2014$|ORA-' /tmp/text
>> Tue Mar 25 19:48:43 2014
>> Tue Mar 25 19:48:47 2014
>> Tue Mar 25 19:52:17 2014
>> ORA-12012: error on auto execute of job 216103
>> ORA-30926: unable to get a stable set of rows in the source tables
>> ORA-06512: at ENTREF.BGNPOSTLOAD, line 156
>> ORA-06512: at line 2
>> Tue Mar 25 19:56:51 2014
>> Tue Mar 25 19:56:51 2014
>> Tue Mar 25 19:56:51 2014
>> $
>>
>> Yes, the simple 'egrep' solution doesn't suppress the timestamp lines 
>> which
>> have no corresponding ORA- lines.  It sounds like you're still 
>> planning on
>> eyeballing the output, so perhaps that's okay?  If not, you can do one
>> better with just a touch more egrep:
>>
>> $ egrep '2014$|^ORA-' /tmp/text | egrep -B1 ^ORA- | grep -v '^--$'
>> Tue Mar 25 19:52:17 2014
>> ORA-12012: error on auto execute of job 216103
>> ORA-30926: unable to get a stable set of rows in the source tables
>> ORA-06512: at ENTREF.BGNPOSTLOAD, line 156
>> ORA-06512: at line 2
>> $
>>
>> Sure, those regexes are a *lot* simpler than William's, but assuming 
>> that
>> the ^ORA- string is fairly unique, the second version should be pretty
>> unlikely to turn up a false match.  It's just grabbing the closest 
>> previous
>> line that looks like a timestamp.
>>
>> The moral of the story that I'm telling is that if you look carefully at
>> most datasets, and know a couple of the more esoteric options of 
>> egrep (eg.
>> multiple matching with |, -A, -B, -C), you can usually get what you need
>> without resorting to heroic complication.
>>
>> Aaron S. Joyner
>>
>>
>>
>> Warning: This post goes really off the rails from here into my own 
>> personal
>> late-night optimizations, tinkering, and a smidge of ranting. Proceed 
>> with
>> caution.
>>
>> PS - As an academic exercise, I briefly cooked a parser in shell 
>> similar to
>> William's solution (it basically hinged on two tricks: [ ${line:0:4} ==
>> ORA- ] and storing the previous line in $line), but it's sufficiently
>> inelegant that I don't feel posting it here would really advance the
>> discussion.  Curious readers might wish to give that approach a try.
>>
>> PPS - Aw, what the he.*.  I have a pet peeve with functional constructs
>> such as map and lambda, a reasonable percentage of people (usually, 
>> myself
>> included) don't find them intuitive.  Hence, my preference is to dump 
>> most
>> of the lines we don't need up front with egrep, then hold just two 
>> lines of
>> state in memory at a time:
>> $ egrep '2014$|ORA-' /tmp/text | while read line; do
>> if [ "${line:0:4}" == ORA- ]; then
>>  echo $last;
>> elif [ "${last:0:4}" == ORA- ]; then
>>  echo $last;
>> fi;
>> last="$line";
>> done
>>
>> PPPS - So... naturally, how about runtime and memory usage 
>> comparison? :-D
>> Note: shell is a horrible text parsing language for many reasons, speed
>> not least among them.  Along the way, we'll see that simple regexes help
>> dramatically, and that it's really hard to beat a well-tuned C 
>> program like
>> egrep.
>> $ for i in `seq 1 1000`; do cat /tmp/text >> /tmp/hugetext; done
>> $ for i in `seq 1 100`; do cat /tmp/hugetext >> /tmp/text.100k.times; 
>> done
>> $ ls -lh /tmp/text.100k.times
>> -rw-r--r-- 1 asjoyner asjoyner 72M Aug 13 00:06 /tmp/text.100k.times$ 
>> wc -l
>> text.100k.times
>> 1800000 text.100k.times
>> $ ln -s text.100k.times
>> bar
>>
>> $ /usr/bin/time egrep '2014$|^ORA-' /tmp/text.100k.times | /usr/bin/time
>> egrep -B1 ^ORA- | /usr/bin/time grep -v '^--$' | wc -l
>> 0.44user 0.10system 0:01.05elapsed 52%CPU (0avgtext+0avgdata
>> 696maxresident)k
>> 0inputs+0outputs (0major+221minor)pagefaults 0swaps
>> 0.18user 0.05system 0:01.05elapsed 22%CPU (0avgtext+0avgdata
>> 716maxresident)k
>> 0inputs+0outputs (0major+226minor)pagefaults 0swaps
>> 0.15user 0.03system 0:01.06elapsed 17%CPU (0avgtext+0avgdata
>> 712maxresident)k
>> 0inputs+0outputs (0major+225minor)pagefaults 0swaps
>> 500000
>>
>> $ /usr/bin/time ./omg.pl | wc -l
>> 28.12user 0.71system 0:29.07elapsed 99%CPU (0avgtext+0avgdata
>> 310008maxresident)k
>> 0inputs+0outputs (0major+96785minor)pagefaults 0swaps
>> 500000
>>
>> $ /usr/bin/time egrep '2014$|ORA-' /tmp/text.100k.times | /usr/bin/time
>> /bin/bash -c '(while read line; do if [ "${line:0:4}" == ORA- ]; then
>> echo $last; elif [ "${last:0:4}" == ORA- ]; then   echo $last; fi;
>> last="$line"; done)' | wc -l
>> 0.60user 0.12system 1:35.42elapsed 0%CPU (0avgtext+0avgdata 
>> 692maxresident)k
>> 0inputs+0outputs (0major+220minor)pagefaults 0swaps
>> 72.87user 17.83system 1:35.62elapsed 94%CPU (0avgtext+0avgdata
>> 1068maxresident)k
>> 0inputs+0outputs (0major+539minor)pagefaults 0swaps
>> 500000
>>
>> The perl solution takes about 29 seconds, the grep-based solution takes
>> about 1 second, the shell solution takes about 1m40s.  The above runs 
>> are
>> just one sample of several, in various orders, to rule out disk 
>> caching and
>> the other randomness associated with timesharing systems.
>>
>> Note that the map()-based perl solution uses ~300MB of RAM to process a
>> 72MB log file, because it slurps the entire file into a single variable.
>> That's not perl's fault, persay, just an artifact of reading in the 
>> whole
>> file at once (and making a couple copies of it along the way), rather 
>> than
>> processing it line by line.  By comparison the egrep/grep-based solution
>> uses about 2MB of ram, and the combined egrep/bash solution weighs in at
>> just under 2MB.
>>
>> PPPPS - Why can't I stop doing this?? I really should sleep... but if 
>> you
>> transliterate my bash parser into perl, with the looser regexes, you get
>> something both light weight on ram, and still pretty fast, though 
>> honestly
>> not nearly as fast as I expected, it still takes 20 seconds:
>> $ /usr/bin/time perl -ne 'if (/^ORA-|2014$/) { if (/^ORA-/) { print 
>> $last;
>> } elsif ($last =~ /^ORA-/) { print $last; } $last = $_; }'
>> /tmp/text.100k.times | wc -l
>> 20.22user 0.10system 0:20.50elapsed 99%CPU (0avgtext+0avgdata
>> 1596maxresident)k
>> 0inputs+0outputs (0major+453minor)pagefaults 0swaps
>> 500000
>>
>> ... there are just too many repetitive regex matches... let's do 
>> something
>> more pearly if we're going to use perl...
>>
>> $ /usr/bin/time perl -ne 'if (/2014$/) { $ts = $_ } if (/^ORA-/) { if 
>> ($ts)
>> { print $ts } print; undef $ts; }' /tmp/text.100k.times | wc -l
>> 2.03user 0.08system 0:02.27elapsed 93%CPU (0avgtext+0avgdata
>> 1592maxresident)k
>> 0inputs+0outputs (0major+452minor)pagefaults 0swaps
>> 500000
>>
>> There we go!  ~1.5MB of ram and about 2 seconds... can we use the same
>> algorithm in bash?  how will it fare?
>>
>> $ cat /tmp/text.100k.times | /usr/bin/time /bin/bash -c 'while read 
>> line;
>> do if [ "${line: -4:4}" == 2014 ]; then ts="$line"; fi; if [ 
>> "${line:0:4}"
>> == ORA- ]; then if [ -n "$ts" ]; then echo "$ts"; fi; echo $line; ts="";
>> fi; done' | wc -l
>> 139.51user 38.58system 3:05.02elapsed 96%CPU (0avgtext+0avgdata
>> 2744maxresident)k
>> 0inputs+0outputs (0major+736minor)pagefaults 0swaps
>> 500000
>>
>> Oh my, what have I done?  Run time is up north of 3 minutes... oh, I
>> dropped the egrep, and now bash is hitting every line, and as discussed,
>> it's awful at .. well basically everything.  Let's slip the egrep 
>> back in
>> there and see how it fares.  It really should be more respectable than
>> that...
>>
>> $ egrep '2014$|^ORA-' /tmp/text.100k.times | /usr/bin/time /bin/bash -c
>> 'while read line; do if [ "${line: -4:4}" == 2014 ]; then ts="$line"; 
>> fi;
>> if [ "${line:0:4}" == ORA- ]; then if [ -n "$ts" ]; then echo "$ts"; fi;
>> echo $line; ts=""; fi; done' | wc -l
>> 86.18user 19.46system 1:50.62elapsed 95%CPU (0avgtext+0avgdata
>> 1176maxresident)k
>> 0inputs+0outputs (0major+344minor)pagefaults 0swaps
>> 500000
>>
>> It's still high, so I stare at it again, and notice that I inadvertently
>> dropped the elif into just an if, meaning it's checking the first 4 
>> chars
>> of every timestamp line to see if it's an ORA line.. lets not do that...
>>
>> $ egrep '2014$|^ORA-' /tmp/text.100k.times | /usr/bin/time /bin/bash -c
>> 'while read line; do if [ "${line: -4:4}" == 2014 ]; then ts="$line"; 
>> elif
>> [ "${line:0:4}" == ORA- ]; then if [ -n "$ts" ]; then echo "$ts"; fi; 
>> echo
>> $line; ts=""; fi; done' | wc -l
>> 72.03user 18.48system 1:35.68elapsed 94%CPU (0avgtext+0avgdata
>> 1180maxresident)k
>> 0inputs+0outputs (0major+345minor)pagefaults 0swaps
>> 500000
>>
>> Okay, that'll have to do.  1m35s, but it's really light on ram! :-D
>>
>> Time to sleep...
>



More information about the TriLUG mailing list