Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does this XML file load slowly?

I have some very simple code:

        XmlDocument doc = new XmlDocument();
        Console.WriteLine("loading");
        doc.Load(url);
        Console.WriteLine("loaded");

        XmlNodeList nodeList = doc.GetElementsByTagName("p");

        foreach(XmlNode node in nodeList)
        {
            Console.WriteLine(node.ChildNodes[0].Value);
        }
        return source;

I'm working on this file and it takes two minutes to load. Why does it take so long? I tried both with fetching and file from the net and loading a local file.

like image 969
John Avatar asked Dec 09 '22 09:12

John


1 Answers

I imagine it's the DTD of the page that's taking so long to load. Given that it defines entities, you shouldn't disable it, so you're probably better off not going down this path.

Given the inner workings of the wikipedia parser (a right mess), I'd say it's a big leap to assume it's going to produce well-formed XHTML every time.

Use HTML Agility Pack to parse (then you can convert to XmlDocument a little more easily if required, IIRC).

If you really want to go down the XmlDocument route you can keep a local cache of the HTML DTDs. See this post, this post and this post for details.

like image 148
spender Avatar answered Dec 25 '22 07:12

spender