CGI.escapeHTML
is pretty bad, but CGI.unescapeHTML
is completely borked. For example:
require 'cgi'
CGI.unescapeHTML('…')
# => "…" # correct - an ellipsis
CGI.unescapeHTML('…')
# => "…" # should be "…"
CGI.unescapeHTML('¢')
# => "\242" # correct - a cent
CGI.unescapeHTML('¢')
# => "¢" # should be "\242"
CGI.escapeHTML("…")
# => "…" # should be "…"
It appears that unescapeHTML
knows about all of the numeric codes plus &
, <
, >
, and "
. And escapeHTML
only knows about those last four -- it doesn't do any of the numeric codes. I understand that escaping doesn't generally need to be as robust since HTML will allow the literal versions of most characters except the four that CGI.escapeHTML
knows about. But unescaping should really be better.
Is there a better tool out there, at least for unescaping?
The htmlentities gem should do the trick:
require 'rubygems'
require 'htmlentities'
coder = HTMLEntities.new
coder.decode('…') # => "…"
coder.decode('…') # => "…"
coder.decode('¢') # => "¢"
coder.decode('¢') # => "¢"
coder.encode("…", :named) # => "…"
coder.encode("…", :decimal) # => "…"
require 'rubygems'
require 'hpricot'
Hpricot('…', :xhtml_strict => true).to_plain_text
Though you might have to fiddle around with the character encoding.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With