Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to speed up loading DTD through DOCTYPE

Tags:

c#

.net

xml

I need to load a number of xhtml files that have this at the top:

<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">

Each file will be loaded into a separate System.Xml.XmlDocument. Because of the DOCTYPE declaration they take a very long time to load. I tried setting XmlResolver = null, but then I get XmlException thrown because I have invalid entities (e.g., ”). So I thought I could download the DTD just for the first XmlDocument and in some way reuse it for the subsequent XmlDocuments (and thus avoid the performance hit), but I have no idea how to do this.

I'm using .Net 3.5.

Thanks.

like image 782
Polyfun Avatar asked Sep 17 '10 06:09

Polyfun


2 Answers

I think you should be able to resolve this resolver issue using XmlPreloadedResolver. However, I'm having some difficulty getting it working myself. It looks like XHTML 1.0 would be easier to support since it is a "known" DTD: XmlKnownDtds while XHTML 1.1 isn't currently "known" which means you'll have to reload a bunch of URIs.

For example:

XmlPreloadedResolver xmlPreloadedResolver = new XmlPreloadedResolver(XmlKnownDtds.Xhtml10);
xmlPreloadedResolver.Add(new Uri("http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"), File.ReadAllBytes("D:\\xhtml11.dtd"));
xmlPreloadedResolver.Add(new Uri("http://www.w3.org/TR/xhtml-modularization/DTD/xhtml-inlstyle-1.mod"), File.ReadAllBytes("D:\\xhtml-inlstyle-1.mod"));
xmlPreloadedResolver.Add(new Uri("http://www.w3.org/TR/xhtml-modularization/DTD/xhtml-framework-1.mod"), File.ReadAllBytes("D:\\xhtml-framework-1.mod"));
xmlPreloadedResolver.Add(new Uri("http://www.w3.org/TR/xhtml-modularization/DTD/xhtml-text-1.mod"), File.ReadAllBytes("D:\\xhtml-text-1.mod"));
xmlPreloadedResolver.Add(new Uri("http://www.w3.org/TR/xhtml-modularization/DTD/xhtml-hypertext-1.mod"), File.ReadAllBytes("D:\\xhtml-hypertext-1.mod"));
xmlPreloadedResolver.Add(new Uri("http://www.w3.org/TR/xhtml-modularization/DTD/xhtml-list-1.mod"), File.ReadAllBytes("D:\\xhtml-list-1.mod"));
xmlPreloadedResolver.Add(new Uri("http://www.w3.org/TR/xhtml-modularization/DTD/xhtml-edit-1.mod"), File.ReadAllBytes("D:\\xhtml-edit-1.mod"));
xmlPreloadedResolver.Add(new Uri("http://www.w3.org/TR/xhtml-modularization/DTD/xhtml-bdo-1.mod"), File.ReadAllBytes("D:\\xhtml-bdo-1.mod"));
xmlPreloadedResolver.Add(new Uri("http://www.w3.org/TR/ruby/xhtml-ruby-1.mod"), File.ReadAllBytes("D:\\xhtml-ruby-1.mod"));
xmlPreloadedResolver.Add(new Uri("http://www.w3.org/TR/xhtml-modularization/DTD/xhtml-pres-1.mod"), File.ReadAllBytes("D:\\xhtml-pres-1.mod"));
// TODO: Add other modules here (see the xhtml11.dtd for the full list)
XmlDocument xmlDocument = new XmlDocument();
xmlDocument.XmlResolver = xmlPreloadedResolver;
xmlDocument.Load("D:\\doc1.xml");
like image 118
Daniel Renshaw Avatar answered Nov 11 '22 03:11

Daniel Renshaw


For .NET Framework 3.5 and below, it might have been possible to use the XmlUrlResolver, as shown in this answer. However, this approach downloads the DTDs from the W3C website at runtime, which is not a good idea, not least because W3C seems to be currently blocking such requests. The other answer suggests caching the DTDs as embedded resources in the assembly, similar to your HTML2XHTML.

For other readers using .NET Framework 4.0 and above, you could use XmlPreloadedResolver, as suggested by Daniel Renshaw, which supports XHTML 1.0. To support XHTML 1.1, you could simplify your implementation by using the flattened version of the DTD, available at xhtml11-flat.dtd on the W3C website. I define an extension method for this purpose:

public static class XmlPreloadedResolverExtensions
{
    private const string Xhtml11DtdPublicId = "-//W3C//DTD XHTML 1.1//EN";
    private const string Xhtml11DtdSystemId = "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd";

    public static void AddXhtml11(this XmlPreloadedResolver resolver, bool @override = false)
    {
        Add(resolver, new Uri(Xhtml11DtdPublicId, UriKind.RelativeOrAbsolute), ManifestResources.xhtml11_flat_dtd, @override);
        Add(resolver, new Uri(Xhtml11DtdSystemId, UriKind.RelativeOrAbsolute), ManifestResources.xhtml11_flat_dtd, @override);
    }

    public static bool Add(this XmlPreloadedResolver resolver, Uri uri, Stream value, bool @override)
    {
        if (@override || !resolver.PreloadedUris.Contains(uri))
        {
            resolver.Add(uri, value);
            return true;
        }

        return false;
    }
}

This could then be used like ordinary XmlResolver instances:

var xmlResolver = new XmlPreloadedResolver();
xmlResolver.AddXhtml11();

XmlReaderSettings settings = new XmlReaderSettings();
settings.DtdProcessing = DtdProcessing.Parse;
settings.XmlResolver = xmlResolver;

XDocument document;
using (var xmlReader = XmlReader.Create(input, settings))
    document = XDocument.Load(xmlReader);
like image 1
Douglas Avatar answered Nov 11 '22 01:11

Douglas