I have this very huge XML file of size 2.8GB. This is Polish Wikipedia's articles dump. The size of this file is very problematic for me. The task is to search this file for some big amount of data. All I have are titles of the articles. I thought that I could sort that titles and use one linear loop through the file. Idea is not so bad, but articles are not sorted alphabetically. They are sorted by ID, which I don't know a priori.
So, my second thought was to make an index of that file. To store in other file (or database) lines in following format: title;id;index
(maybe without an ID). I my other question I asked for help with that. The hypothesis was that if I had index of needed tag I could use just simple Seek
method to move the cursor within the file without reading all content, etc. For smaller files I think this could work fine. But on my computer (laptop, C2D proc, Win7, VS2008) I get error that application is not responding.
In my program, I am reading each line from file and checks if it contains a tag that I need. I am also counting all bytes I read and save lines in format mentioned above. So while indexing program gets hung up. But till then the result index file is 36.2MB and the last index is like 2,872,765,202 (B) while whole XML file is 3,085,439,630 B long.
My third thought was to split the file into smaller pieces. To be precise into 26 pieces (there are 26 letters in Latin language), each containing only entries starting for the same letter, e.g. in a.xml all entries that titles starts at "A" letter. Final files would be like tens of MB, max around 200 MB I think. But there's the same problem with reading whole file.
To read the file I used probably the fastest way: using StreamReader
. I read somewhere that StreamReader
and XmlReader
class from System.Xml
are the fastest methods. StreamReader
even faster that XmlReader
. It's obvious that I can't load all this file into memory. I have installed only 3GB of RAM and Win7 takes like 800MB-1GB when fully loaded.
So I'm asking for help. What is the best to do. The point is that search this XML file has to be fast. Has to be faster then downloading single Wikipedia pages in HTML format. I'm not even sure if that is possible.
Maybe load all the needed content into database? Maybe that would be faster? But still I will need to read the whole file as least once.
I'm not sure if there are some limits about 1 question length, but I will put here also a sample of my indexing source code.
while (reading)
{
if (!reader.EndOfStream)
{
line = reader.ReadLine();
fileIndex += enc.GetByteCount(line) + 2; //+2 - to cover characters \r\n not included into line
position = 0;
}
else
{
reading = false;
continue;
}
if (currentArea == Area.nothing) //nothing interesting at the moment
{
//search for position of <title> tag
position = MoveAfter("<title>", line, position); //searches until it finds <title> tag
if (position >= 0) currentArea = Area.title;
else continue;
}
(...)
if (currentArea == Area.text)
{
position = MoveAfter("<text", line, position);
if (position >= 0)
{
long index = fileIndex;
index -= line.Length;
WriteIndex(currentTitle, currentId, index);
currentArea = Area.nothing;
}
else continue;
}
}
reader.Close();
reader.Dispose();
writer.Close();
}
private void WriteIndex(string title, string id, long index)
{
writer.WriteLine(title + ";" + id + ";" + index.ToString());
}
Best Regards and Thanks in advance,
ventus
Edit: Link to this Wiki's dump http://download.wikimedia.org/plwiki/20100629/
If you want to open an XML file and edit it, you can use a text editor. You can use default text editors, which come with your computer, like Notepad on Windows or TextEdit on Mac. All you have to do is locate the XML file, right-click the XML file, and select the "Open With" option.
Often times, the large size of XML structures is due to the fact that they are an XML representation of a database dump. There might be redundant or even useless information that you could discard with an XSLT transformation.
XML ValidatorBuddy provides support for huge XML data (multi-GB) to view, edit and validate those documents directly in the application. Regardless of the size of the XML, the application will always use about the same amount of memory to view the file.
There is no limit of XML file size but it takes memory (RAM) as file size of XML file, so long XML file parsing size is performance hit. It is advised to long XML size using SAX for . NET to parse long XML documents.
Well.... If you're going to search it, I would highly recommend you find a better way than to deal with the file itself. I suggest as you mention to put it into a well normalized and indexed database and do your searching there. Anything else you do will be effectively duplicating exactly what a database does.
Doing so will take time, however. XmlTextReader is probably your best bet, it works one node at a time. LINQ to XML should also be a fairly efficient processing, but I haven't tried it with a large file and so can't comment.
May I ask: where did this huge XML file come from? Perhaps there's a way to deal with the situation at the source, rather than before having to process a 3 GB file.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With