Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I strip HTML in a string using Perl?

Is there anyway easier than this to strip HTML from a string using Perl?

$Error_Msg =~ s|<b>||ig;
$Error_Msg =~ s|</b>||ig;
$Error_Msg =~ s|<h1>||ig;
$Error_Msg =~ s|</h1>||ig;
$Error_Msg =~ s|<br>||ig;

I would appreicate both a slimmed down regular expression, e.g. something like this:

$Error_Msg =~ s|</?[b|h1|br]>||ig;

Is there an existing Perl function that strips any/all HTML from a string, even though I only need bolds, h1 headers and br stripped?

like image 613
ParoX Avatar asked Jul 01 '09 05:07

ParoX


2 Answers

Assuming the code is valid HTML (no stray < or > operators)

$htmlCode =~ s|<.+?>||g;

If you need to remove only bolds, h1's and br's

$htmlCode =~ s#</?(?:b|h1|br)\b.*?>##g

And you might want to consider the HTML::Strip module

like image 148
Abhinav Gupta Avatar answered Oct 06 '22 02:10

Abhinav Gupta


From perlfaq9: How do I remove HTML from a string?


The most correct way (albeit not the fastest) is to use HTML::Parser from CPAN. Another mostly correct way is to use HTML::FormatText which not only removes HTML but also attempts to do a little simple formatting of the resulting plain text.

Many folks attempt a simple-minded regular expression approach, like s/<.*?>//g, but that fails in many cases because the tags may continue over line breaks, they may contain quoted angle-brackets, or HTML comment may be present. Plus, folks forget to convert entities--like < for example.

Here's one "simple-minded" approach, that works for most files:

#!/usr/bin/perl -p0777
s/<(?:[^>'"]*|(['"]).*?\1)*>//gs

If you want a more complete solution, see the 3-stage striphtml program in http://www.cpan.org/authors/id/T/TO/TOMC/scripts/striphtml.gz .

Here are some tricky cases that you should think about when picking a solution:

<IMG SRC = "foo.gif" ALT = "A > B">

<IMG SRC = "foo.gif"
 ALT = "A > B">

<!-- <A comment> -->

<script>if (a<b && a>c)</script>

<# Just data #>

<![INCLUDE CDATA [ >>>>>>>>>>>> ]]>

If HTML comments include other tags, those solutions would also break on text like this:

<!-- This section commented out.
    <B>You can't see me!</B>
-->
like image 32
brian d foy Avatar answered Oct 06 '22 01:10

brian d foy