Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

XDocument : is it possible to force the load of a malformed XML file?

Tags:

c#

linq-to-xml

I have a malformed XML file. The root tag is not closed by a tag. The final tag is missing.

When I try to load my malformed XML file in C#

StreamReader sr = new StreamReader(path);
batchFile = XDocument.Load(sr); // Exception

I get an exception "Unexpected end of file has occurred. The following elements are not closed: batch. Line 54, position 1."

Is it possible to ignore the close tag or to force the loading? I noticed that all my XML tools ((like XML notepad) ) automaticly fix or ignore the problem. I can not fix the XML file. This one copme from a third party software and sometimes the file is correct.

like image 324
Bastien Vandamme Avatar asked Apr 18 '11 09:04

Bastien Vandamme


3 Answers

You cant do it with XDocument because this class loads all document in memory and parse it completly.
But its possible to process document with XmlReader it would get you to read and process complete document and at the end youll get missing tag exeption.

like image 53
Anton Semenov Avatar answered Oct 12 '22 21:10

Anton Semenov


I suggest using Tidy.NET to cleanup messy input

Tidy.NET has a nice API to get a list of problems (MessageCollection) in your 'XML' and you can use it to fix the text stream in memory. The simplest thing would be to fix one error at a time, thought that will not perform too well with many errors. Otherwise, you might fix errors in reverse document order so that the offsets of messages stay valid while doing the fixes

Here is an example to convert HTML input into XHTML:

Tidy tidy = new Tidy();

/* Set the options you want */
tidy.Options.DocType = DocType.Strict;
tidy.Options.DropFontTags = true;
tidy.Options.LogicalEmphasis = true;
tidy.Options.Xhtml = true;
tidy.Options.XmlOut = true;
tidy.Options.MakeClean = true;
tidy.Options.TidyMark = false;

/* Declare the parameters that is needed */
TidyMessageCollection tmc = new TidyMessageCollection();
MemoryStream input = new MemoryStream();
MemoryStream output = new MemoryStream();

byte[] byteArray = Encoding.UTF8.GetBytes("Put your HTML here...");
input.Write(byteArray, 0 , byteArray.Length);
input.Position = 0;
tidy.Parse(input, output, tmc);

string result = Encoding.UTF8.GetString(output.ToArray());
like image 43
sehe Avatar answered Oct 12 '22 22:10

sehe


What you could do is add the closing tag to the xml in memory and then load it.

So after loading the xml into the streamreader, manipulate the data before you do the xml load

like image 1
Ivo Avatar answered Oct 12 '22 22:10

Ivo