Working with very huge XML file in C#

Tags:

I have this very huge XML file of size 2.8GB. This is Polish Wikipedia's articles dump. The size of this file is very problematic for me. The task is to search this file for some big amount of data. All I have are titles of the articles. I thought that I could sort that titles and use one linear loop through the file. Idea is not so bad, but articles are not sorted alphabetically. They are sorted by ID, which I don't know a priori.

So, my second thought was to make an index of that file. To store in other file (or database) lines in following format: title;id;index (maybe without an ID). I my other question I asked for help with that. The hypothesis was that if I had index of needed tag I could use just simple Seek method to move the cursor within the file without reading all content, etc. For smaller files I think this could work fine. But on my computer (laptop, C2D proc, Win7, VS2008) I get error that application is not responding.

In my program, I am reading each line from file and checks if it contains a tag that I need. I am also counting all bytes I read and save lines in format mentioned above. So while indexing program gets hung up. But till then the result index file is 36.2MB and the last index is like 2,872,765,202 (B) while whole XML file is 3,085,439,630 B long.

My third thought was to split the file into smaller pieces. To be precise into 26 pieces (there are 26 letters in Latin language), each containing only entries starting for the same letter, e.g. in a.xml all entries that titles starts at "A" letter. Final files would be like tens of MB, max around 200 MB I think. But there's the same problem with reading whole file.

To read the file I used probably the fastest way: using StreamReader. I read somewhere that StreamReader and XmlReader class from System.Xml are the fastest methods. StreamReader even faster that XmlReader. It's obvious that I can't load all this file into memory. I have installed only 3GB of RAM and Win7 takes like 800MB-1GB when fully loaded.

So I'm asking for help. What is the best to do. The point is that search this XML file has to be fast. Has to be faster then downloading single Wikipedia pages in HTML format. I'm not even sure if that is possible.

Maybe load all the needed content into database? Maybe that would be faster? But still I will need to read the whole file as least once.

I'm not sure if there are some limits about 1 question length, but I will put here also a sample of my indexing source code.

while (reading)
{
    if (!reader.EndOfStream)
    {
        line = reader.ReadLine();
        fileIndex += enc.GetByteCount(line) + 2; //+2 - to cover characters \r\n not included into line
        position = 0;
    }
    else
    {
        reading = false;
        continue;
    }

    if (currentArea == Area.nothing)    //nothing interesting at the moment
    {
         //search for position of <title> tag
         position = MoveAfter("&lt;title>", line, position);    //searches until it finds &lt;title> tag
         if (position >= 0) currentArea = Area.title;
         else continue;
    }

    (...)

    if (currentArea == Area.text)
    {
         position = MoveAfter("&lt;text", line, position);
         if (position >= 0)
         {
              long index = fileIndex;
              index -= line.Length;
              WriteIndex(currentTitle, currentId, index);
              currentArea = Area.nothing;
         }
         else continue;
     }
 }

 reader.Close();
 reader.Dispose();
 writer.Close();
 }

 private void WriteIndex(string title, string id, long index)
 {
     writer.WriteLine(title + ";" + id + ";" + index.ToString());
 }

Best Regards and Thanks in advance,

ventus

Edit: Link to this Wiki's dump http://download.wikimedia.org/plwiki/20100629/

426

asked Jul 26 '10 18:07

Ventus

1 Answers

Well.... If you're going to search it, I would highly recommend you find a better way than to deal with the file itself. I suggest as you mention to put it into a well normalized and indexed database and do your searching there. Anything else you do will be effectively duplicating exactly what a database does.

Doing so will take time, however. XmlTextReader is probably your best bet, it works one node at a time. LINQ to XML should also be a fairly efficient processing, but I haven't tried it with a large file and so can't comment.

May I ask: where did this huge XML file come from? Perhaps there's a way to deal with the situation at the source, rather than before having to process a 3 GB file.

193

answered Nov 11 '22 19:11

Randolpho

Related questions
                            
                                Given an IP address and subnetmask, how do I calculate the CIDR?
                            
                                Intersect a collection of collections in LINQ
                            
                                Dependency injection: How to pass the injection container around?
                            
                                How to get last Friday of month(s) using .NET
                            
                                Regex for Username?
                            
                                Rhino Mocks Partial Mock
                            
                                Programmatically open a file in Visual Studio (2010)
                            
                                Are background threads a bad idea? Why?
                            
                                Programmatically building an MSI
                            
                                How to Search and Navigate XML Nodes
                            
                                Good Sites that are describing about MVVM Pattern
                            
                                how to send binary data within an xml string
                            
                                C# / XNA - Load objects to the memory - how it works?
                            
                                Looking for a better way to sort my List<T>
                            
                                .NET framework from a low level programmer's point of view
                            
                                Why can't I find System.ServiceModel.WebHttpBinding?
                            
                                my .net program is causing a BSOD
                            
                                c# Linq `List<Interface>.AddRange` Method Not Working
                            
                                is there a datatemplate for grid panel elements in WPF?
                            
                                Serialize/deserialize objects - order of fields matters?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With