I am trying to scrape some data from a webpage with nodejs but I am having problems with character encoding.
The web page states that it's encoding is:
<meta http-equiv="Content-Type" content="text/html; charset=windows-1250">
And when I browse it with chrome it sets encoding to windows-1250 and everything looks fine.
As there is no windows-1250 encoding/decoding for streams in node (and utf8 did not work), I found an iconv-lite package which should be able to easily convert between different encodings. But I still get wrong characters after I save the response into a file (or output into console). I also tried different encodings, native node buffer encodings, setting headers to the same as what I see in chrome (Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3
) but nothing seems to work correctly.
You can see the whole code in here https://gist.github.com/4110999.
I suppose I am missing something fundamental regarding how the encoding works so any help on how to get the data with correct characters would be appreciated.
EDIT:
Also tried the node-iconv package in case it is a package problem. Changed line 51 to:
var decoder = new Iconv_native('WINDOWS-1250', 'UTF-8');
var decoded = decoder.convert(body).toString();
but still getting the same results.
I'm not familiar with the iconv-lite package, but looking through it's code, it looks like you'll need to use win1250
instead of windows1250
(see here)
The encodings are looked up as a hash.
Also, the readme uses this code instead of 'windows1251':
str = iconv.decode(buf, 'win1251');
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With