Problem With Regular Expression to Remove HTML Tags

Question

In my Ruby app, I've used the following method and regular expression to remove all HTML tags from a string:

str.gsub(/<\/?[^>]*>/,"")

This regular expression did just about all I was expecting it to, except it caused all quotation marks to be transformed into “ and all single quotes to be changed to ” .

What's the obvious thing I'm missing to convert the messy codes back into their proper characters?

Edit: The problem occurs with or without the Regular Expression, so it's clear my problem has nothing to do with it. My question now is how to deal with this formatting error and correct it. Thanks!

vladr · Accepted Answer

Use CGI::unescapeHTML after you perform your regular expression substitution:

CGI::unescapeHTML(str.gsub(/<\/?[^>]*>/,""))

See http://www.ruby-doc.org/core/classes/CGI.html#M000547

In the above code snippet, gsub removes all HTML tags. Then, unescapeHTML() reverts all HTML entities (such as <, &#8220) to their actual characters (<, quotes, etc.)

With respect to another post on this page, note that you will never ever be passed HTML such as

<tag attribute="<value>">2 + 3 < 6</tag>

(which is invalid HTML); what you may receive is, instead:

<tag attribute="&lt;value&gt;">2 + 3 &lt; 6</tag>

The call to gsub will transform the above to:

2 + 3 &lt; 6

And unescapeHTML will finish the job:

2 + 3 < 6

Sniggerfardimungus · Answer

You're going to run into more trouble when you see something like:

<doohickey name="<foobar>">

You'll want to apply something like:

gsub(/<[^<>]*>/, "")

...for as long as the pattern matches.

Problem With Regular Expression to Remove HTML Tags

Tags:

string

regex

ruby

encoding

btw

2 Answers

vladr

Sniggerfardimungus

Recent Activity

Donate For Us

Problem With Regular Expression to Remove HTML Tags

Tags:

string

regex

ruby

encoding

btw

2 Answers

vladr

Sniggerfardimungus

Related questions

Recent Activity

Donate For Us