I need to do some processing on fairly large XML files ( large here being potentially upwards of a gigabyte ) in C# including performing some complex xpath queries. The problem I have is that the standard way I would normally do this through the System.XML libraries likes to load the whole file into memory before it does anything with it, which can cause memory problems with files of this size. I don't need to be updating the files at all just reading them and querying the data contained in them. Some of the XPath queries are quite involved and go across several levels of parent-child type relationship - I'm not sure whether this will affect the ability to use a stream reader rather than loading the data into memory as a block. One way I can see of making it work is to perform the simple analysis using a stream-based approach and perhaps wrapping the XPath statements into XSLT transformations that I could run across the files afterward, although it seems a little convoluted. Alternately I know that there are some elements that the XPath queries will not run across, so I guess I could break the document up into a series of smaller fragments based on it's original tree structure, which could perhaps be small enough to process in memory without causing too much havoc. I've tried to explain my objective here so if I'm barking up totally the wrong tree in terms of general approach I'm sure you folks can set me right...

XPathReader is the answer. It isn't part of the C# runtime, but it is available for download from Microsoft. Here is an MSDN article. If you construct an XPathReader with an XmlTextReader you get the efficiency of a streaming read with the convenience of XPath expressions. I haven't used it on gigabyte sized files, but I have used it on files that are tens of megabytes, which is usually enough to slow down DOM based solutions. Quoting from the below: "The XPathReader provides the ability to perform XPath over XML documents in a streaming manner". Download from Microsoft

Gigabyte XML files! I don't envy you this task. Is there any way that the files could be sent in a better way? E.g. Are they being sent over the net to you - if they are then a more efficient format might be better for all concerned. Reading the file into a database isn't a bad idea but it could be very time consuming indeed. I wouldn't try and do it all in memory by reading the entire file - unless you have a 64bit OS and lots of memory. What if the file becomes 2, 3, 4GB? One other approach could be to read in the XML file and use SAX to parse the file and write out smaller XML files according to some logical split. You could then process these with XPath. I've used XPath on 20-30MB files and it is very quick. I was originally going to use SAX but thought I would give XPath a go and was surprised how quick it was. I saved a lot of development time and probably only lost 250ms per query. I was using Java for my parsing but I suspect there would be little difference in .NET. I did read that XML::Twig (A Perl CPAN module) was written explicitly to handle SAX based XPath parsing. Can you use a different language? This might also help https://web.archive.org/web/1/http://articles.techrepublic%2ecom%2ecom/5100-10878_11-1044772.html

How best to use XPath with very large XML files in .NET?

Tags:

c#

.net

xml

large-files

xpath

I need to do some processing on fairly large XML files ( large here being potentially upwards of a gigabyte ) in C# including performing some complex xpath queries. The problem I have is that the standard way I would normally do this through the System.XML libraries likes to load the whole file into memory before it does anything with it, which can cause memory problems with files of this size.

I don't need to be updating the files at all just reading them and querying the data contained in them. Some of the XPath queries are quite involved and go across several levels of parent-child type relationship - I'm not sure whether this will affect the ability to use a stream reader rather than loading the data into memory as a block.

One way I can see of making it work is to perform the simple analysis using a stream-based approach and perhaps wrapping the XPath statements into XSLT transformations that I could run across the files afterward, although it seems a little convoluted.

Alternately I know that there are some elements that the XPath queries will not run across, so I guess I could break the document up into a series of smaller fragments based on it's original tree structure, which could perhaps be small enough to process in memory without causing too much havoc.

I've tried to explain my objective here so if I'm barking up totally the wrong tree in terms of general approach I'm sure you folks can set me right...

291

asked Jan 02 '09 16:01

glenatron

2 Answers

XPathReader is the answer. It isn't part of the C# runtime, but it is available for download from Microsoft. Here is an MSDN article.

If you construct an XPathReader with an XmlTextReader you get the efficiency of a streaming read with the convenience of XPath expressions.

I haven't used it on gigabyte sized files, but I have used it on files that are tens of megabytes, which is usually enough to slow down DOM based solutions.

Quoting from the below: "The XPathReader provides the ability to perform XPath over XML documents in a streaming manner".

Download from Microsoft

116

answered Sep 19 '22 19:09

Richard Wolf

Gigabyte XML files! I don't envy you this task.

Is there any way that the files could be sent in a better way? E.g. Are they being sent over the net to you - if they are then a more efficient format might be better for all concerned. Reading the file into a database isn't a bad idea but it could be very time consuming indeed.

I wouldn't try and do it all in memory by reading the entire file - unless you have a 64bit OS and lots of memory. What if the file becomes 2, 3, 4GB?

One other approach could be to read in the XML file and use SAX to parse the file and write out smaller XML files according to some logical split. You could then process these with XPath. I've used XPath on 20-30MB files and it is very quick. I was originally going to use SAX but thought I would give XPath a go and was surprised how quick it was. I saved a lot of development time and probably only lost 250ms per query. I was using Java for my parsing but I suspect there would be little difference in .NET.

I did read that XML::Twig (A Perl CPAN module) was written explicitly to handle SAX based XPath parsing. Can you use a different language?

This might also help https://web.archive.org/web/1/http://articles.techrepublic%2ecom%2ecom/5100-10878_11-1044772.html

answered Sep 19 '22 19:09

Fortyrunner

Related questions
                            
                                Running Phantomjs using C# to grab snapshot of webpage
                            
                                WPF Listbox Virtualization creates DisconnectedItems
                            
                                Posting data to asp.net Web API
                            
                                How to see code of method which marked as MethodImplOptions.InternalCall?
                            
                                Couldn't find type for class Microsoft.WindowsAzure.Diagnostics
                            
                                How to make non-interactive graphical overlay on top of another program in c#?
                            
                                Partial classes and access modifier issue
                            
                                Add access modifier to method using Roslyn CodeFixProvider?
                            
                                Why can you use just the alias to declare a enum and not the .NET type?
                            
                                How to disable model caching in Entity Framework 6 (Code First approach)
                            
                                How to update FK to null when deleting optional related entity
                            
                                Include property but exclude one of that property's properties
                            
                                EF Core and big traffic leads to max pool size was reached error
                            
                                How to allow for multiple types deployment?
                            
                                .NET Core Global exception handler in console application
                            
                                Detecting/Diagnosing Thread Starvation
                            
                                Using C# 7.2 in modifier for parameters with primitive types
                            
                                ServiceCollection does not contain a definition from "AddLogging"
                            
                                RegEx allowing digit, dash, comma
                            
                                How do I generate .proto files or use 'Code First gRPC' in C#

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With