I'm using Nokogiri and open-uri to grab the contents of the title tag on a webpage, but am having trouble with accented characters. What's the best way to deal with these? Here's what I'm doing:
require 'open-uri' require 'nokogiri' doc = Nokogiri::HTML(open(link)) title = doc.at_css("title")
At this point, the title looks like this:
Rag\303\271
Instead of:
Ragù
How can I have nokogiri return the proper character (e.g. ù in this case)?
Here's an example URL:
http://www.epicurious.com/recipes/food/views/Tagliatelle-with-Duck-Ragu-242037
Summary: When feeding UTF-8 to Nokogiri through open-uri, use open(...).read
and pass the resulting string to Nokogiri.
Analysis: If I fetch the page using curl, the headers properly show Content-Type: text/html; charset=UTF-8
and the file content includes valid UTF-8, e.g. "Genealogía de Jesucristo"
. But even with a magic comment on the Ruby file and setting the doc encoding, it's no good:
# encoding: UTF-8 require 'nokogiri' require 'open-uri' doc = Nokogiri::HTML(open('http://www.biblegateway.com/passage/?search=Mateo1-2&version=NVI')) doc.encoding = 'utf-8' h52 = doc.css('h5')[1] puts h52.text, h52.text.encoding #=> Genealogà a de Jesucristo #=> UTF-8
We can see that this is not the fault of open-uri:
html = open('http://www.biblegateway.com/passage/?search=Mateo1-2&version=NVI') gene = html.read[/Gene\S+/] puts gene, gene.encoding #=> Genealogía #=> UTF-8
This is a Nokogiri issue when dealing with open-uri, it seems. This can be worked around by passing the HTML as a raw string to Nokogiri:
# encoding: UTF-8 require 'nokogiri' require 'open-uri' html = open('http://www.biblegateway.com/passage/?search=Mateo1-2&version=NVI') doc = Nokogiri::HTML(html.read) doc.encoding = 'utf-8' h52 = doc.css('h5')[1].text puts h52, h52.encoding, h52 == "Genealogía de Jesucristo" #=> Genealogía de Jesucristo #=> UTF-8 #=> true
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With