Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a better HTML escaping and unescaping tool than CGI for Ruby?

CGI.escapeHTML is pretty bad, but CGI.unescapeHTML is completely borked. For example:

require 'cgi'

CGI.unescapeHTML('…')
# => "…"                    # correct - an ellipsis

CGI.unescapeHTML('…')
# => "…"             # should be "…"

CGI.unescapeHTML('¢')
# => "\242"                 # correct - a cent

CGI.unescapeHTML('¢')
# => "¢"               # should be "\242"

CGI.escapeHTML("…")
# => "…"                    # should be "…"

It appears that unescapeHTML knows about all of the numeric codes plus &, <, >, and ". And escapeHTML only knows about those last four -- it doesn't do any of the numeric codes. I understand that escaping doesn't generally need to be as robust since HTML will allow the literal versions of most characters except the four that CGI.escapeHTML knows about. But unescaping should really be better.

Is there a better tool out there, at least for unescaping?

like image 889
James A. Rosen Avatar asked Dec 18 '08 19:12

James A. Rosen


2 Answers

The htmlentities gem should do the trick:

require 'rubygems'
require 'htmlentities'

coder = HTMLEntities.new

coder.decode('…') # => "…"
coder.decode('…') # => "…"
coder.decode('¢') # => "¢"
coder.decode('¢') # => "¢"
coder.encode("…", :named) # => "…"
coder.encode("…", :decimal) # => "…"
like image 60
titanous Avatar answered Nov 12 '22 09:11

titanous


require 'rubygems'
require 'hpricot'

Hpricot('…', :xhtml_strict => true).to_plain_text

Though you might have to fiddle around with the character encoding.

like image 31
Chris Lloyd Avatar answered Nov 12 '22 10:11

Chris Lloyd