Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C# - Is it possible (and how) to perform XSL transformations using SgmlReader

I needed to transform the contents of an HTML web page using XSLT . Hence I used SgmlReader and wrote the snippet shown below (I thought, in the end, it's an XmlReader too ...)

XmlReader xslr = XmlReader.Create(new StringReader(
    "<?xml version=\"1.0\" encoding=\"UTF-8\"?>" +
    "<xsl:stylesheet xmlns:xsl=\"http://www.w3.org/1999/XSL/Transform\" version=\"1.0\">" +
    "<xsl:output method=\"xml\" encoding=\"UTF-8\" version=\"1.0\" />" +
    "<xsl:template match=\"/\">" +
    "<XXX xmlns:xsd=\"http://www.w3.org/2001/XMLSchema\"><xsl:value-of select=\"count(//br)\" /></XXX>" +
    "</xsl:template>" +
    "</xsl:stylesheet>"));

XslCompiledTransform xslt = new XslCompiledTransform();
xslt.Load(xslr);

using (SgmlReader html = new SgmlReader())
{
    StringBuilder sb = new StringBuilder();
    using (TextWriter sw = new StringWriter(sb))
    using (XmlWriter xw = new XmlTextWriter(sw))
    {
        html.InputStream = new StringReader(Resources.html_orig);
        html.DocType = "HTML";

        try
        {
            xslt.Transform(html, xw);
            string output = sb.ToString();
            System.Console.WriteLine(output);
        }
        catch (Exception exc)
        {
            System.Console.WriteLine("{0} : {1}", exc.GetType().Name, exc.Message);
            System.Console.WriteLine(exc.StackTrace);
        }
    }
}

Nonetheless , I get thos error message

NullReferenceException : Object reference not set to an instance of an object.
   at MS.Internal.Xml.Cache.XPathDocumentBuilder.Initialize(XPathDocument doc, IXmlLineInfo lineInfo, String baseUri, LoadFlags flags)
   at MS.Internal.Xml.Cache.XPathDocumentBuilder..ctor(XPathDocument doc, IXmlLineInfo lineInfo, String baseUri, LoadFlags flags)
   at System.Xml.XPath.XPathDocument.LoadFromReader(XmlReader reader, XmlSpace space)
   at System.Xml.XPath.XPathDocument..ctor(XmlReader reader, XmlSpace space)
   at System.Xml.Xsl.Runtime.XmlQueryContext.ConstructDocument(Object dataSource, String uriRelative, Uri uriResolved)
   at System.Xml.Xsl.Runtime.XmlQueryContext..ctor(XmlQueryRuntime runtime, Object defaultDataSource, XmlResolver dataSources, XsltArgumentList argList, WhitespaceRuleLookup wsRules)
   at System.Xml.Xsl.Runtime.XmlQueryRuntime..ctor(XmlQueryStaticData data, Object defaultDataSource, XmlResolver dataSources, XsltArgumentList argList, XmlSequenceWriter seqWrt)
   at System.Xml.Xsl.XmlILCommand.Execute(Object defaultDocument, XmlResolver dataSources, XsltArgumentList argumentList, XmlSequenceWriter results)
   at System.Xml.Xsl.XmlILCommand.Execute(Object defaultDocument, XmlResolver dataSources, XsltArgumentList argumentList, XmlWriter writer, Boolean closeWriter)
   at System.Xml.Xsl.XmlILCommand.Execute(XmlReader contextDocument, XmlResolver dataSources, XsltArgumentList argumentList, XmlWriter results)
   at System.Xml.Xsl.XslCompiledTransform.Transform(XmlReader input, XmlWriter results)

I found a way to work around this by converting the HTML to XML and then applying the transform , but that's an inefficient solution because :

  1. Intermediate XHTML output goes to a buffer , so extra memory is needed
  2. Conversion process needs extra CPU processing and the same hierarchy is traversed twice (in theory unnecessarily).

So (since I know StackOverflow community always provides great answers whereas other C# forums have completely disappointed me ;o) I'll be looking for feedback and suggestions so as to perform XSL transformations using HTML directly (even if SgmlReader needs to be replaced by another similar library).

like image 901
Olemis Lang Avatar asked Nov 30 '10 15:11

Olemis Lang


2 Answers

Even if the SgmlReader class is extending the XmlReader class it doesn't mean that it also behaves like an XmlReader.

Technically it also does not make sense that SgmlReader is a subclass of XmlReader, simply because SGML is a superset of XML and not a subset.

You didn't write about the purpose of your transformation, but in general HTML Agility Pack is a good option for manipulating HTML.

like image 151
Dirk Vollmar Avatar answered Sep 30 '22 22:09

Dirk Vollmar


Have you tried using the HTML Agility Pack instead of SgmlReader? You can load the html into it, and run a transform against it directly. I'm not positive if an XML document is created internally, though - although it seems as though one is not you would probably want to compare memory and CPU usage against the conversion method you tried and discarded.

//You already have your xslt loaded into var xslt...

HtmlDocument doc = new HtmlDocument();
doc.Load( ... );  //load your HTML doc, or use LoadXML from a string, etc  
xslt.Transform(doc, xw);

See also this question: How to use HTML Agility pack

like image 25
Philip Rieck Avatar answered Sep 30 '22 20:09

Philip Rieck