Why can't I parse my scraped HTML into XML?

Question

I am trying to parse some scraped HTML into valid xml, using this function.

My test code (with the htmlParse function copied and pasted from Ben Nadel's blog):

<cfscript>
    // I take an HTML string and parse it into an XML(XHTML)
    // document. This is returned as a standard ColdFusion XML
    // document.
    function htmlParse( htmlContent, disableNamespaces = true ){

        // Create an instance of the Xalan SAX2DOM class as the
        // recipient of the TagSoup SAX (Simple API for XML) compliant
        // events. TagSoup will parse the HTML and announce events as
        // it encounters various HTML nodes. The SAX2DOM instance will
        // listen for such events and construct a DOM tree in response.
        var saxDomBuilder = createObject( "java", "com.sun.org.apache.xalan.internal.xsltc.trax.SAX2DOM" ).init();

        // Create our TagSoup parser.
        var tagSoupParser = createObject( "java", "org.ccil.cowan.tagsoup.Parser" ).init();

        // Check to see if namespaces are going to be disabled in the
        // parser. If so, then they will not be added to elements.
        if (disableNamespaces){

        // Turn off namespaces - they are lame an nobody likes
        // to perform xmlSearch() methods with them in place.
        tagSoupParser.setFeature(
        tagSoupParser.namespacesFeature,
        javaCast( "boolean", false )
        );

        }

        // Set our DOM builder to be the listener for SAX-based
        // parsing events on our HTML.
        tagSoupParser.setContentHandler( saxDomBuilder );

        // Create our content input. The InputSource encapsulates the
        // means by which the content is read.
        var inputSource = createObject( "java", "org.xml.sax.InputSource" ).init(
        createObject( "java", "java.io.StringReader" ).init( htmlContent )
        );

        // Parse the HTML. This will trigger events which the SAX2DOM
        // builder will translate into a DOM tree.
        tagSoupParser.parse( inputSource );

        // Now that the HTML has been parsed, we have to get a
        // representation that is similar to the XML document that
        // ColdFusion users are used to having. Let's search for the
        // ROOT document and return is.
        return(
        xmlSearch( saxDomBuilder.getDom(), "/node()" )[ 1 ]
        );

    }
</cfscript>
<cfset html='<tr > <td align="center"> <span id="id1" >Compliance Review</span> </td><td class="center"> <span id="id2" >395.8(i)</span> </td><td align="left"> <span id="id3" >Failing to submit a record of duty status within 13 days </span> </td><td class="center" > <span id="id4">4/17/2014</span> </td> </tr>' />
<cfset parsedData = htmlParse(html) />

(The html is received in this format from a different function, but I tried hardcoding the string for now to trace the problem.)

I get the following error:

NOT_FOUND_ERR: An attempt is made to reference a node in a context where it does not exist. 
The error occurred in myfilePath/myfileName.cfm: line 42

40 :        // Parse the HTML. This will trigger events which the SAX2DOM
41 :        // builder will translate into a DOM tree.
42 :        tagSoupParser.parse( inputSource );

What is going wrong? How can I correct it?

Brian Ghidinelli · Accepted Answer

I haven't used TagSoup but I have been using jTidy for years with great results to take user-provided HTML from all kinds of sources (including MS Word) and clean it up such that it returns XHTML.

You can try jTidy on the same document by dropping the jTidy jar onto your classpath or using JavaLoader to load it. Since you're on CF10, you can use this method to include the JAR.

Then, here's how to call jTidy in cfscript:

jTidy = createObject("java", "org.w3c.tidy.Tidy");

jTidy.setQuiet(false);
jTidy.setIndentContent(true);
jTidy.setSmartIndent(true);
jTidy.setIndentAttributes(true);
jTidy.setWraplen(1024);
jTidy.setXHTML(true);
jTidy.setNumEntities(true);
jTidy.setConvertWindowsChars(true);             
jTidy.setFixBackslash(true);        // changes \ in urls to /
jTidy.setLogicalEmphasis(true);     // uses strong/em instead of b/i
jTidy.setDropEmptyParas(true);

// create the in and out streams for jTidy
readBuffer = CreateObject("java","java.lang.String").init(parseData).getBytes();
inP = createobject("java","java.io.ByteArrayInputStream").init(readBuffer);
outx = createObject("java", "java.io.ByteArrayOutputStream").init();

// do the parsing
jTidy.parse(inP,outx);
outstr = outx.toString();

This will return valid XHTML which you can query against with XPath. I wrapped the above into a makeValid() function and then ran it against your HTML:

    <cfset html='<tr > <td align="center"> <span id="id1" >Compliance Review</span> </td><td class="center"> <span id="id2" >395.8(i)</span> </td><td align="left"> <span id="id3" >Failing to submit a record of duty status within 13 days </span> </td><td class="center" > <span id="id4">4/17/2014</span> </td> </tr>' />
<cfset out = makeValid(html) />
<cfdump var="#xmlParse(out)#" />

And here was the output:

Picture of the cfdump output from xmlParse()

Why can't I parse my scraped HTML into XML?

Tags:

java

coldfusion

html-parsing

web-scraping

froadie

1 Answers

Brian Ghidinelli

Recent Activity

Donate For Us

Why can't I parse my scraped HTML into XML?

Tags:

java

coldfusion

html-parsing

web-scraping

froadie

1 Answers

Brian Ghidinelli

Related questions

Recent Activity

Donate For Us