Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to convert a Net::HTTP response to a certain encoding in Ruby 1.9.1?

I have a Sinatra application (http://analyzethis.espace-technologies.com) that does the following

  1. Retrieve an HTML page (via net/http)
  2. Create a Nokogiri document from the response.body
  3. Extract some info and send it back in the response. The response should be UTF-8 encoded

So I came to the problem while trying to read sites that use windows-1256 encodings like www.filfan.com or www.masrawy.com.

The problem is the result of the encoding conversion is not correct though no errors are thrown.

The net/http response.body.encoding gives ASCII-8BIT which can not be converted to UTF-8

If I do Nokogiri::HTML(response.body) and use the css selectors to get certain content from the page - say the content of the title tag for example - I get a string which when i call string.encoding returns WINDOWS-1256. I use string.encode("utf-8") and send the response using that but again the response is not correct.

Any suggestions or ideas about what's wrong in my approach?

like image 519
humanzz Avatar asked Jul 30 '09 15:07

humanzz


2 Answers

Because Net::HTTP does not handle encoding correctly. See http://bugs.ruby-lang.org/issues/2567

You can parse response['content-type'] which contains charset instead of parsing whole response.body.

Then use force_encoding() to set right encoding.

response.body.force_encoding("UTF-8") if site is served in UTF-8.

like image 170
A.D. Avatar answered Oct 14 '22 15:10

A.D.


I found the following code working for me now

def document
  if @document.nil? && response
    @document = if document_encoding
                  Nokogiri::HTML(response.body.force_encoding(document_encoding).encode('utf-8'),nil, 'utf-8')
                else
                  Nokogiri::HTML(response.body)
                end
  end
  @document
end

def document_encoding
  return @document_encoding if @document_encoding
  response.type_params.each_pair do |k,v|
    @document_encoding = v.upcase if k =~ /charset/i
  end
  unless @document_encoding
    #document.css("meta[http-equiv=Content-Type]").each do |n|
    #  attr = n.get_attribute("content")
    #  @document_encoding = attr.slice(/charset=[a-z1-9\-_]+/i).split("=")[1].upcase if attr
    #end
    @document_encoding = response.body =~ /<meta[^>]*HTTP-EQUIV=["']Content-Type["'][^>]*content=["'](.*)["']/i && $1 =~ /charset=(.+)/i && $1.upcase
  end
  @document_encoding
end 
like image 44
humanzz Avatar answered Oct 14 '22 16:10

humanzz