Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Incompatible encodings with ruby and Nokogiri HTML

I'm parsing an external HTML page with Nokogiri. That page is encoded with ISO-8859-1. Part of the data I want to extract, contains some – (dash) html entities:

xml = Nokogiri.HTML(open("http://flybynight.com.br/agenda.php"), nil, 'ISO-8859-1')
f = xml.xpath("//div[@style='background-color:#D9DBD9; padding:15px 12px 10px 10px;']//div[@class='tit_inter_cnz']/text()")
f[0].text #=> Preview M/E/C/A \u0096 John Digweed

In the last line, the String should be rendered on the browser with a dash. The browser correctly renders it if I specify my page as ISO-8859-1 encoding, however, my Sinatra app uses UTF-8. How can I correctly display that text in the browser? Today is is being displayed as a square with a small number inside. I tried force_encoding('ISO-8859-1'), but then I get a CompatibilityError from Sinatra.

Any clues?

[Edit] Below are screenshots of the app:

-> Firefox with character encoding UTF-8 Firefox with character encoding UTF-8

-> [Firefox with character encoding Western (ISO-8859-1) Firefox with character encoding Western (ISO-8859-1)

It's worth mentioning that in the ISO-8859-1 mode above, the dash is shown correctly, but there is another incorrect character with it just before the dash. Weird :(

like image 281
Felipe Lima Avatar asked Dec 29 '22 02:12

Felipe Lima


1 Answers

After parsing a document in Nokogiri you can tell it to assume a different encoding. Try:

require 'open-uri'
require 'nokogiri'
doc = Nokogiri::HTML((open("http://flybynight.com.br/agenda.php"), nil, 'ISO-8859-1')
doc.encoding = 'UTF-8'

I can't see that page from here, to confirm this fixes the problem, but it's worked for similar problems.

like image 83
the Tin Man Avatar answered Dec 30 '22 15:12

the Tin Man