Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the fastest way to programmatically check the well-formedness of XML files in C#?

I have large batches of XHTML files that are manually updated. During the review phase of the updates I would like to programmatically check the well-formedness of the files. I am currently using a XmlReader, but the time required on an average CPU is much longer than I expected.

The XHTML files range in size from 4KB to 40KB and verifying takes several seconds per file. Checking is essential but I would like to keep the time as short as possible as the check is performed while files are being read into the next process step.

Is there a faster way of doing a simple XML well-formedness check? Maybe using external XML libraries?


I can confirm that validating "regular" XML based content is lightning fast using the XmlReader, and as suggested the problem seems to be related to the fact that the XHTML DTD is read each time a file is validated.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

Note that in addition to the DTD, corresponding .ent files (xhtml-lat1.ent, xhtml-symbol.ent, xhtml-special.ent) are also downloaded.

Since ignoring the DTD completely is not really an option for XHTML as the well-formedness is closely linked to allowed HTML entities (e.g., a &nbsp; will promptly introduce validation errors when we ignore the DTD).


The problem was solved by using a custom XmlResolver as suggested, in combination with local (embedded) copies of both the DTD and entity files.

I will post the solution here once I cleaned up the code

like image 243
barry Avatar asked Feb 09 '09 08:02

barry


People also ask

What is a valid XML and well formed XML?

Valid XML is XML that succeeds validation against a DTD. Well formed XML is XML that has all tags closed in the proper order and, if it has a declaration, it has it first thing in the file with the proper attributes. In other words, validity refers to semantics, well-formedness refers to syntax.

Is XML well formed?

Valid XML files are well-formed files which have a Document Type Definition (DTD) or Schema and which conform to it. They must already be well-formed, so all the rules above apply. A valid file begins with a Document Type Declaration specifying a DTD, or code specifying a W3C Schema.

What format is XML?

What is XML? The Extensible Markup Language (XML) is a simple text-based format for representing structured information: documents, data, configuration, books, transactions, invoices, and much more. It was derived from an older standard format called SGML (ISO 8879), in order to be more suitable for Web use.


2 Answers

I would expect that XmlReader with while(reader.Read)() {} would be the fastest managed approach. It certainly shouldn't take seconds to read 40KB... what is the input approach you are using?

Do you perhaps have some external (schema etc) entities to resolve? If so, you might be able to write a custom XmlResolver (set via XmlReaderSettings) that uses locally cached schemas rather than a remote fetch...

The following does ~300KB virtually instantly:

    using(MemoryStream ms = new MemoryStream()) {
        XmlWriterSettings settings = new XmlWriterSettings();
        settings.CloseOutput = false;
        using (XmlWriter writer = XmlWriter.Create(ms, settings))
        {
            writer.WriteStartElement("xml");
            for (int i = 0; i < 15000; i++)
            {
                writer.WriteElementString("value", i.ToString());
            }
            writer.WriteEndElement();
        }
        Console.WriteLine(ms.Length + " bytes");
        ms.Position = 0;
        int nodes = 0;
        Stopwatch watch = Stopwatch.StartNew();
        using (XmlReader reader = XmlReader.Create(ms))
        {
            while (reader.Read()) { nodes++; }
        }
        watch.Stop();
        Console.WriteLine("{0} nodes in {1}ms", nodes,
            watch.ElapsedMilliseconds);
    }
like image 97
Marc Gravell Avatar answered Sep 20 '22 03:09

Marc Gravell


Create an XmlReader object by passing in an XmlReaderSettings object that has the ConformanceLevel.Document.

This will validate well-formedness.

This MSDN article should explain the details.

like image 43
Cerebrus Avatar answered Sep 20 '22 03:09

Cerebrus