[TriLUG] A curious regular expression

Tom Bryan tbryan at python.net
Tue Apr 10 13:23:18 EDT 2007


William Sutton wrote:

>> Huh?  I missed the first part of the thread, but which REGEX language are you 
>> talking about?  If we're talking Perlish regex, don't the brackets make it a 
>> character class?  That is, the {,99} doesn't indicate a quantifier, it just 
>> adds the characters '{', ',', '9', and '}' to the character class.

> {x,y} in perl regex is in fact a quantifier; empirical tests show that 
> when x isn't specified, it is treated as '1'; thus, we're talking about 
> 1-99 (inclusive) sequential '@' characters.

Sure, but *not* inside a character class.  That was the entire point of 
my previous message.  I'm not a master of regular expressions, but 
character classes don't work the way you're saying they do.  :)

>                '+@#&%',    # leading .*, has @*

As I said before, the + isn't part of the match.  It's a red herring. 
Just because 'a\wc' matches 'abc' and 'azc' and 'xyzabc' does not mean 
that the 'xyz' in the last example are part of the match.  They're just 
extra characters that the regex scans past on its way to a the matching 
string. :)

> Now then, the regex (in perl, dunno what regex language was originally 
> being used) is as follows:
> 
> [		# character class
> 	.*	# 0 or more characters
> 	\@{,99}	# and 1-99 '@' characters
> ]		# end character class
> #&%		# followed by '#&%'

And I would disagree entirely.  Once you're in the character class, the 
., *, and { are just characters.  They have no special meaning.

> in other words, you must have at least a single '@' somewhere in the 
> character class; 

No, you don't.  See below.

> before or after .* doesn't matter; 

No.  Here, the .* are just other options.  It can be a . or * or @.  I 
can put something before the @, but that's irrelevant.  It's *not* part 
of the string that matches the regular expression.

 > can have 0 or more
> characters (of unspecified value) either before or after the '@' 
> character(s), and the string has to also contain '#&%' following the 
> character class.

I'll say it again.  You must have one of
. * @ { , 9 }
followed by
#&%


Here's a modified version of your program.  Let's use the $& special 
variable here to show what part of my matching strings actually matched 
the regular expression.  That might clear some things up.  I also 
include some counter examples that show that I can match without an @ in 
the string.  I also show that when I match longer string (0 or more 
characters plus 1-99 @'s), the characters preceding the @#&% at the end 
of the string are not actually part of the match.

####
my @strings = (
                '#&%',      # no leading .*, no @*
                '@#&%',     # no leading.*, has @*
                '+#&%',     # leading .*, no @*
                '+@#&%',    # leading .*, has @*
	       'not_part_of_the_match@@@@@#&%',  # matches last 4 chars
                ',#&%',     # also a valid match, no @'s
                '9#&%',     # also a valid match, no @'s
                '{#&%',     # also a valid match, no @'s
                '}#&%',     # also a valid match, no @'s
                '*#&%'      # also a valid match, no @'s
               );

foreach my $string (@strings)
{
     print "string $string ";
     if ($string =~ m/[.*\@{,99}]#&%/)
     {
	print "matches this text '$&'.\n";
     }
     else
     {
         print "does not match.\n";
     }
}
####

string #&% does not match.
string @#&% matches this text '@#&%'.
string +#&% does not match.
string +@#&% matches this text '@#&%'.
string not_part_of_the_match@@@@@#&% matches this text '@#&%'.
string ,#&% matches this text ',#&%'.
string 9#&% matches this text '9#&%'.
string {#&% matches this text '{#&%'.
string }#&% matches this text '}#&%'.
string *#&% matches this text '*#&%'.

Regards,
---Tom



More information about the TriLUG mailing list