I have a Windows desktop app written in C# that loops through a bunch of XML files stored on disk and created by a 3rd party program. Most all the files are loaded and processed successfully by the LINQ code that follows this statement:
XDocument xmlDoc = XDocument.Load(inFileName);
List<DocMetaData> docList =
(from d in xmlDoc.Descendants("DOCUMENT")
select new DocMetaData
{
File = d.Element("FILE").SafeGetAttributeValue("filename")
,
Folder = d.Element("FOLDER").SafeGetAttributeValue("name")
,
ItemID = d.Elements("INDEX")
.Where(i => (string)i.Attribute("name") == "Item ID(idmId)")
.Select(i => (string)i.Attribute("value"))
.FirstOrDefault()
,
Comment = d.Elements("INDEX")
.Where(i => (string)i.Attribute("name") == "Comment(idmComment)")
.Select(i => (string)i.Attribute("value"))
.FirstOrDefault()
,
Title = d.Elements("INDEX")
.Where(i => (string)i.Attribute("name") == "Title(idmName)")
.Select(i => (string)i.Attribute("value"))
.FirstOrDefault()
,
DocClass = d.Elements("INDEX")
.Where(i => (string)i.Attribute("name") == "Document Class(idmDocType)")
.Select(i => (string)i.Attribute("value"))
.FirstOrDefault()
}
).ToList<DocMetaData>();
...where inFileName is a full path and filename such as:
Y:\S2Out\B0000004\Pet Tab\convert.B0000004.Pet Tab.xml
But a few of the files cause problems like this:
System.Xml.XmlException: Invalid character in the given encoding. Line 52327, position 126.
at System.Xml.XmlTextReaderImpl.Throw(Exception e)
at System.Xml.XmlTextReaderImpl.Throw(String res, String arg)
at System.Xml.XmlTextReaderImpl.InvalidCharRecovery(Int32& bytesCount, Int32& charsCount)
at System.Xml.XmlTextReaderImpl.GetChars(Int32 maxCharsCount)
at System.Xml.XmlTextReaderImpl.ReadData()
at System.Xml.XmlTextReaderImpl.ParseAttributeValueSlow(Int32 curPos, Char quoteChar, NodeData attr)
at System.Xml.XmlTextReaderImpl.ParseAttributes()
at System.Xml.XmlTextReaderImpl.ParseElement()
at System.Xml.XmlTextReaderImpl.ParseElementContent()
at System.Xml.XmlTextReaderImpl.Read()
at System.Xml.Linq.XContainer.ReadContentFrom(XmlReader r)
at System.Xml.Linq.XContainer.ReadContentFrom(XmlReader r, LoadOptions o)
at System.Xml.Linq.XDocument.Load(XmlReader reader, LoadOptions options)
at System.Xml.Linq.XDocument.Load(String uri, LoadOptions options)
at System.Xml.Linq.XDocument.Load(String uri)
at CBMI.WinFormsUI.GridForm.processFile(StreamWriter oWriter, String inFileName, Int32 XMLfileNumber) in C:\ProjectsVS2010\CBMI.LatitudePostConverter\CBMI.LatitudePostConverter\CBMI.WinFormsUI\GridForm.cs:line 147
at CBMI.WinFormsUI.GridForm.btnProcess_Click(Object sender, EventArgs e) in C:\ProjectsVS2010\CBMI.LatitudePostConverter\CBMI.LatitudePostConverter\CBMI.WinFormsUI\GridForm.cs:line 105
The XML files look like this (this sample shows only 2 DOCUMENT elements but there are many):
<?xml version="1.0" ?>
<DOCUMENTCOLLECTION>
<DOCUMENT>
<FILE filename="e:\S2Out\B0000005\General\D003712420.0001.pdf" outputpath="e:\S2Out\B0000005\General"/>
<ANNOTATION filename=""/>
<INDEX name="Comment(idmComment)" value=""/>
<INDEX name="Document Class(idmDocType)" value="General"/>
<INDEX name="Item ID(idmId)" value="003712420"/>
<INDEX name="Original File Name(idmDocOriginalFile)" value="Matrix Aligning 603.24 Criteria to Petition Pages.pdf"/>
<INDEX name="Title(idmName)" value="Matrix for 603.24"/>
<FOLDER name="/Accreditation/PASBVE/2004-06"/>
</DOCUMENT>
<DOCUMENT>
<FILE filename="e:\S2Out\B0000005\General\D003712442.0001.pdf" outputpath="e:\S2Out\B0000005\General"/>
<ANNOTATION filename=""/>
<INDEX name="Comment(idmComment)" value=""/>
<INDEX name="Document Class(idmDocType)" value="General"/>
<INDEX name="Item ID(idmId)" value="003712442"/>
<INDEX name="Original File Name(idmDocOriginalFile)" value="Contacts at NDU.pdf"/>
<INDEX name="Title(idmName)" value="Contacts at NDU"/>
<FOLDER name="/Accreditation/NDU/2006-12/Self-Study"/>
</DOCUMENT>
The LINQ statements have their own complexities but I think it works OK; it is the LOAD that fails. I have looked at the various constructors for XDocument Load and I've researched some other questions having this Exception thrown but I am confused about how to prevent this.
Lastly, at line 52327, position 126, in the file that failed to load, it appears that this data on line 52327 should NOT have caused the problem (and the last character is at position 103!
<FILE filename="e:\S2Out\B0000004\Pet Tab\D003710954.0001.pdf" outputpath="e:\S2Out\B0000004\Pet Tab"/>
In order to control the encoding (once you know what it is), you can load the files using the Load
method override that accepts a Stream
.
Then you can create a new StreamReader
against your file specifying the appropriate Encoding
in the constructor.
For example, to open the file using Western European encoding, replace the following line of code in the question:
XDocument xmlDoc = XDocument.Load(inFileName);
with this code:
XDocument xmlDoc = null;
using (StreamReader oReader = new StreamReader(inFileName, Encoding.GetEncoding("ISO-8859-1"))) {
xmlDoc = XDocument.Load(oReader);
}
The list of supported encodings can be found in the MSDN documentation.
The referenced file contains a character that is valid for a filename, but invalid in an XML attribute. You have a few options.
Not sure if this is your case, but this can be related to invalid byte sequences for a given encoding. Example: http://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences.
Try filtering invalid sequences from the file while loading.
Because XmlDocument loads the entire thing as soon as it runs into an unencoded character it aborts the entire process. If you want to process what you can and skip/log duff bits, look at XmlTextReader. XmlTextReader loaded from a Filestream will load a node at a time, so it will also use a lot less memory. You could even get clever and split the thing up and parallelise the processing.
When I've had this it's been things like accented characters in there: grave, acutes, umlauts, and such.
I don't have any automated processes, so usually I just load the file in Visual Studio and edited the bad guys out until there are no squigglies left. The theory is sound though.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With