Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Big5 to utf-8 encoding while scraping website with Node-request

I am new to Node.js, and I am trying to use the request model to scrap a website, I am having problem with the encoding: the target website is using big5 as encoding, and I wished to turn it to utf-8 with the following code:

var Iconv = require('iconv').Iconv;
var fs = require('fs');
var big5_to_utf8 = new Iconv('big5', 'utf-8');
var buffer = big5_to_utf8.convert(fs.readFileSync('./test'));
console.log(buffer.toString());

I doubt that the problem might be caused due to some wrong in the scrapping process, so for your reference, my code for scrapping:

var fs = require('fs');
var request = require('request');

var j = request.jar()
var cookie = request.cookie('ASPSESSIONIDCSDCTTSR=KDMMMIMDCCIHJIJFDKGEDFOH')
j.add(cookie)

request({
    url: 'http://amis.afa.gov.tw/v-asp/v101r.asp',
    method: "POST",
    "Content-type": "application/x-www-form-urlencoded;",
    jar:true,
    encoding: 'utf-8',
    form: {
        mhidden1:false,
        myy:101,
        mmm:9,
        mdd:25,
        mpno:"FC",
        mpnoname:"%ADJ%A5%CA++++",
        B1:"%B6%7D%A9l%ACd%B8%DF",
    }
}, function (error, response, body) {
    console.log(body);
    fs.writeFile("test", body);
});

Really appreciate your help.

EDIT:

To be more specific to the error, the following are what the code returns:

<p align="center"><font color="#800080">�Шϥ��s�����u���C��</font><em><font
size="4" color="#000080">[�W�@��]</font></em><font color="#800080">�^���e�@���J�����e���~���d��</font></p>

This is what it should return:

<p align="center"><font color="#800080">請使用瀏覽器工具列中</font><em><font size="4" color="#000080">[上一頁]</font></em><font color="#800080">回到前一輸入條件畫面繼續查詢</font></p>

I also tried to use iconv-lite instead of iconv, replacing the function call to the following:

function (error, response, body) {
    var bufferhelper = new BufferHelper();
    bufferhelper.concat(body);
    console.log(iconv.decode(bufferhelper.toBuffer(), 'Big5'));
});

Only to get:

<p align="center"><font color="#800080">�濆詉胬胬譃胬舚胬</font><em><font
size="4" color="#000080">[抝胬]</font></em><font color="#800080">䒷胬蓚胬鸜胬胬蓚胬趦胬胬</font</p>
like image 570
muyueh Avatar asked Oct 24 '13 05:10

muyueh


2 Answers

I use iconv-lite to decode big5 to utf8.

And you should set encoding:null that request will return raw encoding page.

This is sample code.

var iconv = require('iconv-lite');
var request = require('request');
request({ url: 'http://amis.afa.gov.tw/v-asp/v101r.asp',encoding:null}, function(err,     response, body) {
  if (!err && response.statusCode == 200) {
    var str = iconv.decode(new Buffer(body), "big5");
    console.log(str);
  }
});

And return is

<html>

<head>
<meta http-equiv="Content-Type" content="text/html; charset=big5">
<title>v101r</title>
<meta name="GENERATOR" content="Microsoft FrontPage 4.0">
<meta name="Microsoft Theme" content="none, default">
</head>

<body>
<p align="center">查無結果!</p>

<p align="center"><font color="#800080">請使用瀏覽器工具列中</font><em><font
size="4" color="#000080">[上一頁]</font></em><font color="#800080">回到前一輸入條件畫面繼續查詢</font></p>
</body>
</html>

I use node.js 0.10.20 on RedHat EL 6.4 and iconv-lite 0.2.11, request 2.27.0

like image 175
Ian Wu Avatar answered Sep 23 '22 13:09

Ian Wu


Might I suggest my codepage library:

var request = require('request'), codepage = require('codepage')
request({ url: 'http://amis.afa.gov.tw/v-asp/v101r.asp',encoding:null}, function(err,     response, body) {
  if (!err && response.statusCode == 200) {
    var str = codepage.utils.decode(950, new Buffer(body));
    console.log(str);
  }
});

yields

... <p align="center"><font color="#800080">請使用瀏覽器工具列中</font><em><font
size="4" color="#000080">[上一頁]</font></em><font color="#800080">回到前一輸入條件畫面繼續查詢</font></p>
like image 37
SheetJS Avatar answered Sep 20 '22 13:09

SheetJS