Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Ruby, Nokogiri: how do i ensure UTF8 throughout nokogiri parsing, erb template, and encoding HTML file

I finally managed to parse parts of a website:

get '/' do
  url = '<website>'
  data = Nokogiri::HTML(open(url))
  @rows = data.css("td[valign=top] table tr") 
  erb :muster
end

Now I am trying to extract a certain line in my view. Therefore i put in my HTML code:

<%= @rows[2] %> 

And it actually returns the code, but it has problems with UTF8:

<td class="class_name">&nbsp;</td>

instead it says

<td class="class_name">�</td>

How do I ensure UTF8 during nokogiri parsing, erb, and HTML generation?

like image 370
littleprinter Avatar asked Jan 31 '15 16:01

littleprinter


1 Answers

See: http://www.nokogiri.org/tutorials/parsing_an_html_xml_document.html#encoding

It looks like in your case, the document is declaring that it's encoded using iso8859:

<meta http-equiv="Content-Type" content="text/html;charset=iso-8859-1">

You can do the following to force Nokogiri to treat the stream as UTF-8:

data = Nokogiri::HTML(open(url), nil, Encoding::UTF_8.to_s)
like image 66
rainkinz Avatar answered Sep 22 '22 09:09

rainkinz