Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

http.get and ISO-8859-1 encoded responses

I'm about to write a RSS-feed fetcher and stuck with some charset problems.

Loading and parsing the feed was quite easy compared to the encoding. I'm loading the feed with http.get and I'm putting the chunks together on every data event. Later I'm parsing the whole string with the npm-lib feedparser which works fine with the given string.

Sadly I'm used to functions like utf8_encode() in php and I'm missing them in node.js so I'm stuck with using Iconv which is currently not doing what I want.

Without encoding there are several utf8 ?-icons for wrong charset, with iconv, the string is parsed wrong :/

Currently I'm encoding every string seperatedly:

//var encoding ≈ ISO-8859-1 etc. (Is the right one, checked with docs etc.)
// Shortend version

var iconv = new Iconv(encoding, 'UTF-8');

parser.on('article', function(article){
    var object = {
        title : iconv.convert(article.title).toString('UTF-8'),
        description : iconv.convert(article.summary).toString('UTF-8')
    }
    Articles.push(object);
});

Should I start encoding with data-buffers or later with the complete string?

Thank you!

PS: Encoding is determined with parsing the head of xml

How about a module which makes encoding in node.js easier?

like image 671
moe Avatar asked Jan 18 '12 18:01

moe


2 Answers

You are probably hitting the same problem described on https://groups.google.com/group/nodejs/browse_thread/thread/b2603afa31aada9c.

The solution seems to be to set the response encoding to binary before processing the Buffer with Iconv.

The relevant bit is

set response.setEncoding('binary') and aggregate the chunks into a buffer before calling Iconv.convert(). Note that encoding=binary means your data callback will receive Buffer objects, not strings.


Updated: this was my initial response

Are you sure that the feed you are receiving has been encoded correctly?

I can see two possible errors:

  1. the feed is being sent with Latin-1-encoded data, but with a Content-Type that states charset=UTF-8.
  2. the feed is being sent with UTF-8-encoded data but the Content-Type header does not state anything, defaulting to ASCII.

You should check the content of your feed and the sent headers with some utility like Wireshark or cURL.

like image 110
gioele Avatar answered Nov 12 '22 19:11

gioele


I think the issue is probably with the way that you are storing the data before you are passing it to feedparser. It is hard to say without seeing your data event handler, but I'm going to guess that you are doing something like this:

values = '';
stream.on('data', function(chunk){
  values += chunk;
});

Is that right?

The issue is that in this case, chunk is a buffer, and by using '+' to append them all together, you implicitly convert the buffer to a string.

Looking into it further, you should really be doing the iconv conversion on the whole feed, before running it through feedparser, because feedparser is likely not aware of other encodings.

Try something like this:

var iconv = new Iconv('ISO-8859-1', 'UTF8');
var chunks = [];
var totallength = 0;
stream.on('data', function(chunk) {
  chunks.push(chunk);
  totallength += chunk.length;
});
stream.on('end', function() {
  var results = new Buffer(totallength);
  var pos = 0;
  for (var i = 0; i < chunks.length; i++) {
    chunks[i].copy(results, pos);
    pos += chunks[i].length;
  }
  var converted = iconv.convert(results);
  parser.parseString(converted.toString('utf8'));
});
like image 23
loganfsmyth Avatar answered Nov 12 '22 18:11

loganfsmyth