Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Clean up & style characters from text

I am getting text from a feed that has alot of characters like:

Insignia™ 2.0 Stereo Computer Speaker System (2-Piece) - Black
4th-Generation Apple® iPod® touch

Is there an easy way to get rid of these, or do I have to anticipate which characters I want to delete and use the delete method to remove them? Also, when I try to remove

&

with

str.delete("&")

It leaves behind "amp;" Is there a better way to delete this type of character? Do I need to re-encode the text?

like image 947
Jeremy Smith Avatar asked Oct 18 '11 14:10

Jeremy Smith


2 Answers

String#delete is certainly not what you want, as it works on characters, not the string as a whole.

Try

str.gsub /&/, ""

You may also want to try replacing the & with a literal ampersand, such as:

str.gsub /&/, "&"

If this is closer to what you really want, you may get the best results unescaping the HTML string. If so try this:

CGI::unescapeHTML(str)

Details of the unescapeHTML method are here.

like image 184
Mark Thomas Avatar answered Sep 26 '22 10:09

Mark Thomas


If you are getting data from a 'feed', aka RSS XML, then you should be using an XML parser like Nokogiri to process the XML. This will automatically unescape HTML entities and allow you to get the proper string representation directly.

like image 26
Phrogz Avatar answered Sep 22 '22 10:09

Phrogz