I have some very simple code:
XmlDocument doc = new XmlDocument();
Console.WriteLine("loading");
doc.Load(url);
Console.WriteLine("loaded");
XmlNodeList nodeList = doc.GetElementsByTagName("p");
foreach(XmlNode node in nodeList)
{
Console.WriteLine(node.ChildNodes[0].Value);
}
return source;
I'm working on this file and it takes two minutes to load. Why does it take so long? I tried both with fetching and file from the net and loading a local file.
I imagine it's the DTD of the page that's taking so long to load. Given that it defines entities, you shouldn't disable it, so you're probably better off not going down this path.
Given the inner workings of the wikipedia parser (a right mess), I'd say it's a big leap to assume it's going to produce well-formed XHTML every time.
Use HTML Agility Pack to parse (then you can convert to XmlDocument
a little more easily if required, IIRC).
If you really want to go down the XmlDocument
route you can keep a local cache of the HTML DTDs. See this post, this post and this post for details.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With