[TriLUG] simple regular expression to strip HTML?

Wed Feb 18 23:23:37 EST 2004

On Wed, 18 Feb 2004, Tanner Lovelace wrote:

> Jeremy Portzer said the following on 2/18/04 9:28 PM:
> 
> > Does anyone know of a quick-and-dirty regular expression that will strip
> > simple HTML tags?  I'm not looking for something that is necessarily
> > 100% safe/tested, but something reasonable that will work.  It needs to
> > use the regular C regexp set of calls, not Perl extensions.
> > 
> > For example:  "<em>Bold</em> type" should substitute to "Bold type"
> > 
> 
> Doing some experimentation, I see that perl is normally greedy, but
> if you postpend a quantifier with ? it turns that off.  So, this
> should remove all html tags from a file:
> 
> perl -pi -e 's/<.*?>//g' [filename]
> 
> I have tested this and it seems to work for me.  YMMV.

Unfortunately, the non-greedy operator -- the question mark, is not 
standard to the C library regexp() call, which I'm using.  However, the 
following accomplishes something similar (my thanks to 'scalar' on IRC) :
	s/<[^>]+>//g

This doesn't take into account cases where a > character might be quoted 
within a value inside an HTML tag, but I don't need to worry about that 
for my simple application.

Thanks for the help everyone (both here and on IRC).

--Jeremy

-- 
/---------------------------------------------------------------------\
| Jeremy Portzer        jeremyp at pobox.com      trilug.org/~jeremy     |
| GPG Fingerprint: 712D 77C7 AB2D 2130 989F  E135 6F9F F7BC CC1A 7B92 |
\---------------------------------------------------------------------/