Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex to match all HTML tags except <p> and </p>

Tags:

I need to match and remove all tags using a regular expression in Perl. I have the following:

<\\??(?!p).+?> 

But this still matches with the closing </p> tag. Any hint on how to match with the closing tag as well?

Note, this is being performed on xhtml.

like image 586
Xetius Avatar asked Aug 27 '08 10:08

Xetius


2 Answers

If you insist on using a regex, something like this will work in most cases:

# Remove all HTML except "p" tags $html =~ s{<(?>/?)(?:[^pP]|[pP][^\s>/])[^>]*>}{}g; 

Explanation:

s{   <             # opening angled bracket   (?>/?)        # ratchet past optional /    (?:     [^pP]       # non-p tag     |           # ...or...     [pP][^\s>/] # longer tag that begins with p (e.g., <pre>)   )   [^>]*         # everything until closing angled bracket   >             # closing angled bracket  }{}gx; # replace with nothing, globally 

But really, save yourself some headaches and use a parser instead. CPAN has several modules that are suitable. Here's an example using the HTML::TokeParser module that comes with the extremely capable HTML::Parser CPAN distribution:

use strict;  use HTML::TokeParser;  my $parser = HTML::TokeParser->new('/some/file.html')   or die "Could not open /some/file.html - $!";  while(my $t = $parser->get_token) {   # Skip start or end tags that are not "p" tags   next  if(($t->[0] eq 'S' || $t->[0] eq 'E') && lc $t->[1] ne 'p');    # Print everything else normally (see HTML::TokeParser docs for explanation)   if($t->[0] eq 'T')   {     print $t->[1];   }   else   {     print $t->[-1];   } } 

HTML::Parser accepts input in the form of a file name, an open file handle, or a string. Wrapping the above code in a library and making the destination configurable (i.e., not just printing as in the above) is not hard. The result will be much more reliable, maintainable, and possibly also faster (HTML::Parser uses a C-based backend) than trying to use regular expressions.

like image 146
John Siracusa Avatar answered Sep 23 '22 18:09

John Siracusa


In my opinion, trying to parse HTML with anything other than an HTML parser is just asking for a world of pain. HTML is a really complex language (which is one of the major reasons that XHTML was created, which is much simpler than HTML).

For example, this:

<HTML /   <HEAD /     <TITLE / > /     <P / > 

is a complete, 100% well-formed, 100% valid HTML document. (Well, it's missing the DOCTYPE declaration, but other than that ...)

It is semantically equivalent to

<html>   <head>     <title>       &gt;     </title>   </head>   <body>     <p>       &gt;     </p>   </body> </html> 

But it's nevertheless valid HTML that you're going to have to deal with. You could, of course, devise a regex to parse it, but, as others already suggested, using an actual HTML parser is just sooo much easier.

like image 38
Jörg W Mittag Avatar answered Sep 21 '22 18:09

Jörg W Mittag