Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get line number in an xml file when it exceeds int.Maxvalue

Tags:

c#

.net

xml

I am unable to get the linenumber in an xml file that is nearly 300 GB. IXmlLineInfo.LineNumber is an int32 and when it exceeds the int.MaxValue a negative number is returned. It makes no difference if I use an int or a long to store the linenumber -tried both. Xmlreader is able to read to eof. Using .net 2.0 and newest version also uses an int32.

public void ReadLines()
    {
        long readcounter = 0;
        long linenumber = 0;
        fname = "I:\\XML Files\\europe-latest.osm";
        XmlReaderSettings settings = new XmlReaderSettings();
        settings.ProhibitDtd = false;
        settings.XmlResolver = null;
        XmlReader reader = XmlReader.Create(fname, settings);

        IXmlLineInfo lineInfo = ((IXmlLineInfo)reader);
        try
        {
            while (reader.Read())
            {
                linenumber = lineInfo.LineNumber;
                readcounter++;
                if (readcounter % 1000000 == 0) Console.WriteLine(linenumber.ToString());
            }
        }
        catch (XmlException ex)
        {
            Console.WriteLine(ex.Message);
            Console.ReadLine();
        }
        finally
        {
            reader.Close();
            Console.WriteLine(DateTime.Now.ToLongTimeString());
        }

    }
like image 492
user204427 Avatar asked Jun 30 '14 18:06

user204427


2 Answers

There isn't much you can try:

1) Use System.Numerics.BigInteger to store actual line number - Check after each operation that the line number is not lesser than it was before, while storing the actual line number in BigInteger. Well, in a very enormous file it can actually overflow and become greater than it was before(after reading, for example, 5 billion line element in few inner increments):

var actualLine = new System.Numerics.BigInteger(0);

Int32 lastInt32Line = lineInfo.LineNumber;

// Some Xml reading

Int32 diff = lineInfo.LineNumber - lastLine;

// If an overflow has happened - add overflow
if (diff >= 0)
    actualLine += (new BigInteger(Int32.MaxValue)) * 2 - diff;
else // Everything is normal - add the diff
    actualLine += diff;

The real possible problem is that despite the fact that you store the line number correctly the internals of the XmlReader may begin to collapse. In my opinion the checked integer arithmetic code should be the default one, not the unchecked as it is now - when the overflow happens then the class is corrupted if it is not explicitly told otherwise.

2) Reorganize your data storage to handle the data in a more fragmented manner.
3) Write your own XmlReader that uses the BigInteger.

like image 159
Eugene Podskal Avatar answered Nov 14 '22 22:11

Eugene Podskal


After investigating it a bit with dotpeek, it seems the problem is deeply rooted in the internal XmlTextReaderImpl class (this should be the actual type of the reader you are using) and internal types it is using:

internal struct LineInfo
{
    internal int lineNo;
    internal int linePos;
    // ...
}

If you want to approach this with minimal work required, I recommend you get .NET source code, create your own Xml reader by copying XmlTextReaderImpl (and all related internal types), replacing all the line number ints with BigIntegers. If you want to hide the type, you might want to create an interface IXmlBigLineInfo or similar, and use it instead of IXmlLineInfo.

Hope this helps.

like image 42
gwiazdorrr Avatar answered Nov 14 '22 22:11

gwiazdorrr