Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Getting correct string from windows-1250 encoded web page with node.js

I am trying to scrape some data from a webpage with nodejs but I am having problems with character encoding. The web page states that it's encoding is: <meta http-equiv="Content-Type" content="text/html; charset=windows-1250"> And when I browse it with chrome it sets encoding to windows-1250 and everything looks fine.

As there is no windows-1250 encoding/decoding for streams in node (and utf8 did not work), I found an iconv-lite package which should be able to easily convert between different encodings. But I still get wrong characters after I save the response into a file (or output into console). I also tried different encodings, native node buffer encodings, setting headers to the same as what I see in chrome (Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3) but nothing seems to work correctly.

You can see the whole code in here https://gist.github.com/4110999.

I suppose I am missing something fundamental regarding how the encoding works so any help on how to get the data with correct characters would be appreciated.

EDIT:
Also tried the node-iconv package in case it is a package problem. Changed line 51 to:

var decoder = new Iconv_native('WINDOWS-1250', 'UTF-8');  
var decoded = decoder.convert(body).toString();

but still getting the same results.

like image 826
aocenas Avatar asked Nov 19 '12 14:11

aocenas


1 Answers

I'm not familiar with the iconv-lite package, but looking through it's code, it looks like you'll need to use win1250 instead of windows1250 (see here)

The encodings are looked up as a hash.

Also, the readme uses this code instead of 'windows1251':

str = iconv.decode(buf, 'win1251');
like image 100
Jim Schubert Avatar answered Nov 02 '22 01:11

Jim Schubert