KeyNotFoundException with using HtmlEntity.DeEntitize() method

Question

I am currently working on a scraper written in C# 4.0. I use variety of tools, including the built-in WebClient and RegEx features of .NET. For a part of my scraper I am parsing a HTML document using HtmlAgilityPack. I got everything to work as I desired and went through some cleanup of the code.

I am using the HtmlEntity.DeEntitize() method to clean up the HTML. I made a few tests and the method seemed to work great. But when I implemented the method in my code I kept getting KeyNotFoundException. There are no further details so I'm pretty lost. My code looks like this:

WebClient client = new WebClient();
string html = HtmlEntity.DeEntitize(client.DownloadString(path));
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);

The HTML downloaded is UTF-8 encoded. How can I get around the KeyNotFound exception?

Shoaib Mohamed · Accepted Answer

I understand that the problem is due to occurrence of non-standard characters. Say, for example, Chinese, Japanese etc.

After you find out that what characters are causing a problem, perhaps you could search for the suitable patch to htmlagilitypack here

This may be of some help to you in case you want to modify the htmlagilitypack source yourself.

KeyNotFoundException with using HtmlEntity.DeEntitize() method

Tags:

c#

html-agility-pack

keynotfoundexception

Sebastian Brandes Kraaijenzank

1 Answers

Shoaib Mohamed

Recent Activity

Donate For Us

KeyNotFoundException with using HtmlEntity.DeEntitize() method

Tags:

c#

html-agility-pack

keynotfoundexception

Sebastian Brandes Kraaijenzank

1 Answers

Shoaib Mohamed

Related questions

Recent Activity

Donate For Us