Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Node.js Cheerio parser breaks UTF-8 encoding

I parse my request with Cheerio like this:

var url = http://shop.nag.ru/catalog/16939.IP-videonablyudenie-OMNY/16944.IP-kamery-OMNY-c-vario-obektivom/16704.OMNY-1000-PRO;
request.get(url, function (err, response, body) {
  console.log(body);
   $ = cheerio.load(body);
   console.log($(".description").html());
});

And as output I see content but in unreadable strange encoding:

//Plain body console.log(body) (p.s. russian chars): 
<h1><span style="font-size: 16px;">Уличная 3Мп IP HD камера OMNY - попробуйте найти лучше</span></h1><p style

//  cheerio's console.log $(".description").html()
<h1><span style="font-size: 16px;">&#x423;&#x43B;&#x438;&#x447;&#x43D;&#x430;&#x44F; 3&#x41C;&#x43F; IP HD &#x43A;&#x430;&#x43C;&#x435;&#x440;&#x430; OMNY

Target url link coding is in UTF-8 format. So why Cheerio breaks my encoding?

Trying to use iconv to encode my body responce:

var body1 = iconv.decode(body, "utf-8");

but console.log($(".description").html()); still returns weird text.

like image 635
MeetJoeBlack Avatar asked Jul 22 '15 21:07

MeetJoeBlack


2 Answers

Cheerio hasn't broken anything. It's outputting HTML entities, which will be rendered by any browser exactly the same as the HTML input. Run this snippet to see what I mean:

<h1><span style="font-size: 16px;">Уличная 3Мп IP HD камера OMNY - попробуйте найти лучше</span></h1>

<h1><span style="font-size: 16px;">&#x423;&#x43B;&#x438;&#x447;&#x43D;&#x430;&#x44F; 3&#x41C;&#x43F; IP HD &#x43A;&#x430;&#x43C;&#x435;&#x440;&#x430; OMNY - &#x43F;&#x43E;&#x43F;&#x440;&#x43E;&#x431;&#x443;&#x439;&#x442;&#x435; &#x43D;&#x430;&#x439;&#x442;&#x438; &#x43B;&#x443;&#x447;&#x448;&#x435;</span></h1>

&#x423;, for example, is the character У encoded as an HTML entity, in the same way the entity &gt; represents >.

However, if you want to get the unencoded text, you can set the decodeEntities option to false:

const $ = cheerio.load(
  `<h1><span style="font-size: 16px;">Уличная 3Мп IP HD камера OMNY - попробуйте найти лучше</span></h1>`,
  { decodeEntities: false }
);


console.log($('span').html())
// => Уличная 3Мп IP HD камера OMNY - попробуйте найти лучше
.as-console-wrapper{min-height:100%}
<script src="https://bundle.run/[email protected]"></script>
like image 166
Jordan Running Avatar answered Nov 10 '22 01:11

Jordan Running


I was having an issue early today when tried to load with cheerio a page where we had special characters like ç, á, é, etc...

The way cheerio works is that is tries to decode characters by nature and present the numerical HTML encoding of the Unicode character

for example: instead of ç it would give us &#xE7;.

In order to sort that issue, I just had to turn off this config by adding: decodeEntities: false as a cheerio load param.

const $ = cheerio.load(body, { decodeEntities: false });
like image 4
costargc Avatar answered Nov 10 '22 00:11

costargc