I am currently working on a scraper written in C# 4.0. I use variety of tools, including the built-in WebClient and RegEx features of .NET. For a part of my scraper I am parsing a HTML document using HtmlAgilityPack. I got everything to work as I desired and went through some cleanup of the code.
I am using the HtmlEntity.DeEntitize()
method to clean up the HTML. I made a few tests and the method seemed to work great. But when I implemented the method in my code I kept getting KeyNotFoundException
. There are no further details so I'm pretty lost. My code looks like this:
WebClient client = new WebClient();
string html = HtmlEntity.DeEntitize(client.DownloadString(path));
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
The HTML downloaded is UTF-8 encoded. How can I get around the KeyNotFound
exception?
I understand that the problem is due to occurrence of non-standard characters. Say, for example, Chinese, Japanese etc.
After you find out that what characters are causing a problem, perhaps you could search for the suitable patch to htmlagilitypack here
This may be of some help to you in case you want to modify the htmlagilitypack source yourself.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With