Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Working with very huge XML file in C#

Tags:

c#

xml

I have this very huge XML file of size 2.8GB. This is Polish Wikipedia's articles dump. The size of this file is very problematic for me. The task is to search this file for some big amount of data. All I have are titles of the articles. I thought that I could sort that titles and use one linear loop through the file. Idea is not so bad, but articles are not sorted alphabetically. They are sorted by ID, which I don't know a priori.

So, my second thought was to make an index of that file. To store in other file (or database) lines in following format: title;id;index (maybe without an ID). I my other question I asked for help with that. The hypothesis was that if I had index of needed tag I could use just simple Seek method to move the cursor within the file without reading all content, etc. For smaller files I think this could work fine. But on my computer (laptop, C2D proc, Win7, VS2008) I get error that application is not responding.

In my program, I am reading each line from file and checks if it contains a tag that I need. I am also counting all bytes I read and save lines in format mentioned above. So while indexing program gets hung up. But till then the result index file is 36.2MB and the last index is like 2,872,765,202 (B) while whole XML file is 3,085,439,630 B long.

My third thought was to split the file into smaller pieces. To be precise into 26 pieces (there are 26 letters in Latin language), each containing only entries starting for the same letter, e.g. in a.xml all entries that titles starts at "A" letter. Final files would be like tens of MB, max around 200 MB I think. But there's the same problem with reading whole file.

To read the file I used probably the fastest way: using StreamReader. I read somewhere that StreamReader and XmlReader class from System.Xml are the fastest methods. StreamReader even faster that XmlReader. It's obvious that I can't load all this file into memory. I have installed only 3GB of RAM and Win7 takes like 800MB-1GB when fully loaded.

So I'm asking for help. What is the best to do. The point is that search this XML file has to be fast. Has to be faster then downloading single Wikipedia pages in HTML format. I'm not even sure if that is possible.

Maybe load all the needed content into database? Maybe that would be faster? But still I will need to read the whole file as least once.

I'm not sure if there are some limits about 1 question length, but I will put here also a sample of my indexing source code.

while (reading)
{
    if (!reader.EndOfStream)
    {
        line = reader.ReadLine();
        fileIndex += enc.GetByteCount(line) + 2; //+2 - to cover characters \r\n not included into line
        position = 0;
    }
    else
    {
        reading = false;
        continue;
    }

    if (currentArea == Area.nothing)    //nothing interesting at the moment
    {
         //search for position of <title> tag
         position = MoveAfter("&lt;title>", line, position);    //searches until it finds &lt;title> tag
         if (position >= 0) currentArea = Area.title;
         else continue;
    }

    (...)

    if (currentArea == Area.text)
    {
         position = MoveAfter("&lt;text", line, position);
         if (position >= 0)
         {
              long index = fileIndex;
              index -= line.Length;
              WriteIndex(currentTitle, currentId, index);
              currentArea = Area.nothing;
         }
         else continue;
     }
 }

 reader.Close();
 reader.Dispose();
 writer.Close();
 }

 private void WriteIndex(string title, string id, long index)
 {
     writer.WriteLine(title + ";" + id + ";" + index.ToString());
 }

Best Regards and Thanks in advance,

ventus

Edit: Link to this Wiki's dump http://download.wikimedia.org/plwiki/20100629/

like image 426
Ventus Avatar asked Jul 26 '10 18:07

Ventus


People also ask

How do I open a heavy XML file?

If you want to open an XML file and edit it, you can use a text editor. You can use default text editors, which come with your computer, like Notepad on Windows or TextEdit on Mac. All you have to do is locate the XML file, right-click the XML file, and select the "Open With" option.

Why is my XML file so big?

Often times, the large size of XML structures is due to the fact that they are an XML representation of a database dump. There might be redundant or even useless information that you could discard with an XSLT transformation.

How do I open and edit a large XML file?

XML ValidatorBuddy provides support for huge XML data (multi-GB) to view, edit and validate those documents directly in the application. Regardless of the size of the XML, the application will always use about the same amount of memory to view the file.

How big can an XML file be?

There is no limit of XML file size but it takes memory (RAM) as file size of XML file, so long XML file parsing size is performance hit. It is advised to long XML size using SAX for . NET to parse long XML documents.


1 Answers

Well.... If you're going to search it, I would highly recommend you find a better way than to deal with the file itself. I suggest as you mention to put it into a well normalized and indexed database and do your searching there. Anything else you do will be effectively duplicating exactly what a database does.

Doing so will take time, however. XmlTextReader is probably your best bet, it works one node at a time. LINQ to XML should also be a fairly efficient processing, but I haven't tried it with a large file and so can't comment.

May I ask: where did this huge XML file come from? Perhaps there's a way to deal with the situation at the source, rather than before having to process a 3 GB file.

like image 193
Randolpho Avatar answered Nov 11 '22 19:11

Randolpho