Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

KeyNotFoundException with using HtmlEntity.DeEntitize() method

I am currently working on a scraper written in C# 4.0. I use variety of tools, including the built-in WebClient and RegEx features of .NET. For a part of my scraper I am parsing a HTML document using HtmlAgilityPack. I got everything to work as I desired and went through some cleanup of the code.

I am using the HtmlEntity.DeEntitize() method to clean up the HTML. I made a few tests and the method seemed to work great. But when I implemented the method in my code I kept getting KeyNotFoundException. There are no further details so I'm pretty lost. My code looks like this:

WebClient client = new WebClient();
string html = HtmlEntity.DeEntitize(client.DownloadString(path));
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);

The HTML downloaded is UTF-8 encoded. How can I get around the KeyNotFound exception?

like image 312
Sebastian Brandes Kraaijenzank Avatar asked Nov 07 '12 18:11

Sebastian Brandes Kraaijenzank


1 Answers

I understand that the problem is due to occurrence of non-standard characters. Say, for example, Chinese, Japanese etc.

After you find out that what characters are causing a problem, perhaps you could search for the suitable patch to htmlagilitypack here

This may be of some help to you in case you want to modify the htmlagilitypack source yourself.

like image 116
Shoaib Mohamed Avatar answered Oct 27 '22 00:10

Shoaib Mohamed