I am unable to get the linenumber in an xml file that is nearly 300 GB. IXmlLineInfo.LineNumber is an int32 and when it exceeds the int.MaxValue a negative number is returned. It makes no difference if I use an int or a long to store the linenumber -tried both. Xmlreader is able to read to eof. Using .net 2.0 and newest version also uses an int32.
public void ReadLines()
{
long readcounter = 0;
long linenumber = 0;
fname = "I:\\XML Files\\europe-latest.osm";
XmlReaderSettings settings = new XmlReaderSettings();
settings.ProhibitDtd = false;
settings.XmlResolver = null;
XmlReader reader = XmlReader.Create(fname, settings);
IXmlLineInfo lineInfo = ((IXmlLineInfo)reader);
try
{
while (reader.Read())
{
linenumber = lineInfo.LineNumber;
readcounter++;
if (readcounter % 1000000 == 0) Console.WriteLine(linenumber.ToString());
}
}
catch (XmlException ex)
{
Console.WriteLine(ex.Message);
Console.ReadLine();
}
finally
{
reader.Close();
Console.WriteLine(DateTime.Now.ToLongTimeString());
}
}
There isn't much you can try:
1) Use System.Numerics.BigInteger to store actual line number - Check after each operation that the line number is not lesser than it was before, while storing the actual line number in BigInteger. Well, in a very enormous file it can actually overflow and become greater than it was before(after reading, for example, 5 billion line element in few inner increments):
var actualLine = new System.Numerics.BigInteger(0);
Int32 lastInt32Line = lineInfo.LineNumber;
// Some Xml reading
Int32 diff = lineInfo.LineNumber - lastLine;
// If an overflow has happened - add overflow
if (diff >= 0)
actualLine += (new BigInteger(Int32.MaxValue)) * 2 - diff;
else // Everything is normal - add the diff
actualLine += diff;
The real possible problem is that despite the fact that you store the line number correctly the internals of the XmlReader may begin to collapse. In my opinion the checked
integer arithmetic code should be the default one, not the unchecked as it is now - when the overflow happens then the class is corrupted if it is not explicitly told otherwise.
2) Reorganize your data storage to handle the data in a more fragmented manner.
3) Write your own XmlReader that uses the BigInteger.
After investigating it a bit with dotpeek, it seems the problem is deeply rooted in the internal XmlTextReaderImpl
class (this should be the actual type of the reader you are using) and internal types it is using:
internal struct LineInfo
{
internal int lineNo;
internal int linePos;
// ...
}
If you want to approach this with minimal work required, I recommend you get .NET source code, create your own Xml reader by copying XmlTextReaderImpl
(and all related internal types), replacing all the line number int
s with BigInteger
s. If you want to hide the type, you might want to create an interface IXmlBigLineInfo
or similar, and use it instead of IXmlLineInfo
.
Hope this helps.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With