I need to match and remove all tags using a regular expression in Perl. I have the following:
<\\??(?!p).+?>
But this still matches with the closing </p>
tag. Any hint on how to match with the closing tag as well?
Note, this is being performed on xhtml.
If you insist on using a regex, something like this will work in most cases:
# Remove all HTML except "p" tags $html =~ s{<(?>/?)(?:[^pP]|[pP][^\s>/])[^>]*>}{}g;
Explanation:
s{ < # opening angled bracket (?>/?) # ratchet past optional / (?: [^pP] # non-p tag | # ...or... [pP][^\s>/] # longer tag that begins with p (e.g., <pre>) ) [^>]* # everything until closing angled bracket > # closing angled bracket }{}gx; # replace with nothing, globally
But really, save yourself some headaches and use a parser instead. CPAN has several modules that are suitable. Here's an example using the HTML::TokeParser module that comes with the extremely capable HTML::Parser CPAN distribution:
use strict; use HTML::TokeParser; my $parser = HTML::TokeParser->new('/some/file.html') or die "Could not open /some/file.html - $!"; while(my $t = $parser->get_token) { # Skip start or end tags that are not "p" tags next if(($t->[0] eq 'S' || $t->[0] eq 'E') && lc $t->[1] ne 'p'); # Print everything else normally (see HTML::TokeParser docs for explanation) if($t->[0] eq 'T') { print $t->[1]; } else { print $t->[-1]; } }
HTML::Parser accepts input in the form of a file name, an open file handle, or a string. Wrapping the above code in a library and making the destination configurable (i.e., not just print
ing as in the above) is not hard. The result will be much more reliable, maintainable, and possibly also faster (HTML::Parser uses a C-based backend) than trying to use regular expressions.
In my opinion, trying to parse HTML with anything other than an HTML parser is just asking for a world of pain. HTML is a really complex language (which is one of the major reasons that XHTML was created, which is much simpler than HTML).
For example, this:
<HTML / <HEAD / <TITLE / > / <P / >
is a complete, 100% well-formed, 100% valid HTML document. (Well, it's missing the DOCTYPE declaration, but other than that ...)
It is semantically equivalent to
<html> <head> <title> > </title> </head> <body> <p> > </p> </body> </html>
But it's nevertheless valid HTML that you're going to have to deal with. You could, of course, devise a regex to parse it, but, as others already suggested, using an actual HTML parser is just sooo much easier.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With