Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read large xml file without loading it in memory and using XElement

I want to read a large xml file (100+M). Due to its size, I do not want to load it in memory using XElement. I am using linq-xml queries to parse and read it.

What's the best way to do it? Any example on combination of XPath or XmlReader with linq-xml/XElement?

Please help. Thanks.

like image 695
hIpPy Avatar asked Feb 12 '10 05:02

hIpPy


People also ask

Which reader is used to read data from XML files?

This article describes how to use the XmlTextReader class to read the XML data from a file. The XmlTextReader class provides direct parsing and tokenizing of the XML data.

Which API loads the entire XML into memory for parsing?

Overview. SAX, also known as the Simple API for XML, is used for parsing XML documents.


3 Answers

Yes, you can combine XmlReader with the method XNode.ReadFrom, see the example in the documentation which uses C# to selectively process nodes found by the XmlReader as an XElement.

like image 173
Martin Honnen Avatar answered Sep 18 '22 06:09

Martin Honnen


The example code in the MSDN documentation for the XNode.ReadFrom method is as follows:

class Program
{
    static IEnumerable<XElement> StreamRootChildDoc(string uri)
    {
        using (XmlReader reader = XmlReader.Create(uri))
        {
            reader.MoveToContent();
            // Parse the file and display each of the nodes.
            while (reader.Read())
            {
                switch (reader.NodeType)
                {
                    case XmlNodeType.Element:
                        if (reader.Name == "Child")
                        {
                            XElement el = XElement.ReadFrom(reader) as XElement;
                            if (el != null)
                                yield return el;
                        }
                        break;
                }
            }
        }
    }

    static void Main(string[] args)
    {
        IEnumerable<string> grandChildData =
            from el in StreamRootChildDoc("Source.xml")
            where (int)el.Attribute("Key") > 1
            select (string)el.Element("GrandChild");

        foreach (string str in grandChildData)
            Console.WriteLine(str);
    }
}

But I've found that the StreamRootChildDoc method in the example needs to be modified as follows:

    static IEnumerable<XElement> StreamRootChildDoc(string uri)
    {
        using (XmlReader reader = XmlReader.Create(uri))
        {
            reader.MoveToContent();
            // Parse the file and display each of the nodes.
            while (!reader.EOF)
            {
                if (reader.NodeType == XmlNodeType.Element && reader.Name == "Child")
                {
                    XElement el = XElement.ReadFrom(reader) as XElement;
                    if (el != null)
                        yield return el;
                }
                else
                {
                    reader.Read();
                }
            }
        }
    }
like image 42
Kenny Evitt Avatar answered Sep 18 '22 06:09

Kenny Evitt


Just keep in mind that you will have to read the file sequentially and referring to siblings or descendants is going to be slow at best and impossible at worst. Otherwise @MartinHonnn has the key.

like image 33
No Refunds No Returns Avatar answered Sep 22 '22 06:09

No Refunds No Returns