[TriLUG] regex: match lines NOT containing X

Aaron Joyner aaron at joyner.ws
Thu Feb 23 13:50:21 EST 2012


I got pulled away from this and forgot to hit send yesterday
afternoon... but better late than never, I suppose.  :)

Just a style point... when I see people use regexes with negative
lookaheads, or other features that require me to pick up the copy of
Mastering Regular Expressions off my bookshelf, I invariably think
"you're doing it wrong".  I realize you probably have a very special
case, and in those cases I'm always very glad those features exist in
most regex parsers... but lest someone else get the idea that these
features are *awesome* for daily use, it's worth pointing out that for
most use cases, they're very inefficient.  To boot, regexes of that
style can be very difficult for the 'next guy' to unpack.  Never
forget, the 'next guy' is usually you, just 6 months from now.

Let's take an example with a slightly larger text corpus, so we can
get some idea of how the calculation time stacks up.  For example, the
full text of the US Constitution, lines of text which contain the
string 'People', but do not contain the string 'Union'.

asjoyner:/tmp$ curl -s http://www.usconstitution.net/const.txt > const.txt
asjoyner:/tmp$ wc const.txt
  872  7652 45119 const.txt
asjoyner:/tmp$ time (grep -P 'People' const.txt | grep -Pv Union)
Year by the People of the several States, and the Electors in each State shall

real	0m0.002s
user	0m0.000s
sys	0m0.000s
asjoyner:/tmp$ time grep -P '^(.*People)(.(?!Union))*$' const.txt
Year by the People of the several States, and the Electors in each State shall

real	0m0.004s
user	0m0.000s
sys	0m0.000s

That's a somewhat short synthetic example, but even on 872 lines of
text it's half as fast.  Most of that time is startup overhead.  Let's
make the sample a bit bigger to tease out the regex engine comparison.

asjoyner:/tmp$ for i in `seq 1 10000`; do cat const.txt >> 10000consts.txt; done
asjoyner:/tmp$ wc 10000consts.txt
  8720000  76520000 451190000 10000consts.txt
asjoyner:/tmp$ time (grep -P 'People' 10000consts.txt | grep -Pv Union) | wc -l
10000

real	0m0.409s
user	0m0.400s
sys	0m0.030s
asjoyner:/tmp$ time grep -P '^(.*People)(.(?!Union))*$' 10000consts.txt  | wc -l
10000

real	0m7.079s
user	0m7.000s
sys	0m0.060s

So, now that we've gotten the startup overhead to be a meaninglessly
small fraction, we can see that the negative lookahead version, even
when using the PCRE engine for the simpler regex, is more than 17
times faster.  It's also going to be marginally more memory efficient,
particularly for longer lines of input, etc, etc.

Ultimately, the most important thing to consider, is when you have to
unpack what's going on in 6 months, which of these two conceptual
operations will be easier for you to interpret:
grep two | grep -v two.*heads
grep -P '^(.*People)(.(?!Union))*$'

Aaron S. Joyner


On Wed, Feb 22, 2012 at 11:19 AM, Kevin Hunter <hunteke at earlham.edu> wrote:
> At 11:11am -0500 Wed, 22 Feb 2012, Justis Peters wrote:
>>
>> On 02/22/2012 09:56 AM, Kevin Hunter wrote:
>>>
>>> -----
>>> If two heads are better than one,
>>> are two keyboards better than one?
>>> Would you say the same if it were two hydra heads?
>>> -----
>>>
>>> My criteria is very specific: I want lines that do *not* contain
>>> 'heads'.
>>
>>
>> If the regex library you are using has "negative lookahead", you can do
>> this:
>> /^((?!heads).)*$/;
>>
>> Thanks to http://stackoverflow.com/a/406408/795280 for this answer.
>
>
> Oh oh!  I /knew/ someone would have solved this before.  Alas.  Your Google
> is better than my Google.  Thank you!  And you beat me to the list.  You get
> the glory.  :-)
>
> Thanks for the pointer.  Much better explained that I just did, obviously.
>
> Cheers,
>
> Kevin
>
> --
> This message was sent to: Aaron S. Joyner <aaron at joyner.ws>
> To unsubscribe, send a blank message to trilug-leave at trilug.org from that
> address.
> TriLUG mailing list : http://www.trilug.org/mailman/listinfo/trilug
> Unsubscribe or edit options on the web  :
> http://www.trilug.org/mailman/options/trilug/aaron%40joyner.ws
> TriLUG FAQ          : http://www.trilug.org/wiki/Frequently_Asked_Questions



More information about the TriLUG mailing list