Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Handling bad UTF-8 from json,in ruby

I'm pulling data from remote json at http://hndroidapi.appspot.com/news/format/json/page/?appid=test . The problem I'm running into is that this API appears to be building the JSON without correctly handling UTF-8 encoding (correct me if I'm wrong here). For example, part of the result that gets passed right now is

{
"title":"IPad - please don€™t ding while you and I are asleep  ",
"url":"http://modern-products.tumblr.com/post/25384729998/ipad-please-dont-ding-while-you-and-i-are-asleep",
"score":"10 points",
"user":"roee",
"comments":"18 comments",
"time":"1 hour ago",
"item_id":"4128497",
"description":"10 points by roee 1 hour ago  | 18 comments"
}

Notice the don€™t. And that isn't the only type of character it is choking on. Is there anything I can do to convert the data into something clean, given that I don't control the API?

Edit:

Here is how I'm pulling down the JSON:

hn_url = "http://hndroidapi.appspot.com/news/format/json/page/?appid=test"
  url = URI.parse(hn_url)

  # Attempt to get the json
  req = Net::HTTP::Get.new(hn_url)
  req.add_field('User-Agent', 'Test')
  res = Net::HTTP.start(url.host, url.port) {|http| http.request(req) }
  response = res.body
  if response.nil?
    puts "Bad response when fetching HN json"
    return
  end

  # Attempt to parse the json
  result = JSON.parse(response)
  if result.nil?
    puts "Error parsing HN json"
    return
  end

Edit 2:

Just found the API's GitHub page. Looks like this is an outstanding issue. Still not sure if there's any workarounds that I can do from my end: https://github.com/glebpopov/Hacker-News-Droid-API/issues/4

like image 697
hodgesmr Avatar asked Jun 18 '12 22:06

hodgesmr


1 Answers

It looks like the JSON response body you are receiving is being received in US-ASCII instead of UTF-8 because Net::HTTP purposely doesn't force encoding.

1.9.3p194 :044 > puts res.body.encoding
US-ASCII

In Ruby 1.9.3, you can force the encoding if you know what it's supposed to be. Try this:

response = res.body.force_encoding('UTF-8')

The JSON parser should then handle the UTF-8 the way you want it to.

References

  • http://bugs.ruby-lang.org/ - Net::HTTP does not handle encoding correctly
like image 194
fdsaas Avatar answered Oct 21 '22 10:10

fdsaas