How can I build an HTML org.w3c.dom.Document?

Tags:

The documentation of the Document interface describes the interface as:

The Document interface represents the entire HTML or XML document.

javax.xml.parsers.DocumentBuilder builds XML Documents. However, I am unable to find a way to build a Document that is an HTML Document!

I want an HTML Document because I am trying to build a document that I then pass to a library that is expecting an HTML Document. This library uses Document#getElementsByTagName(String tagname) in a non case-sensitive manner, which is fine for HTML, but not for XML.

I've looked around, and am not finding anything. Items like How to convert an Html source of a webpage into org.w3c.dom.Document in java? don't actually have an answer.

227

asked Mar 13 '15 21:03

Dmitry Minkovsky

1 Answers

You seem to have two explicit requirements:

You need to represent HTML as a org.w3c.dom.Document.
You need Document#getElementsByTagName(String tagname) to operate in a case-insensitive manner.

If you are trying to work with HTML using org.w3c.dom.Document, then I assume you are working with some flavor of XHTML. Because an XML API, such as DOM, is going to expect well-formed XML. HTML isn't necessarily well-formed XML, but XHTML is well-formed XML. Even if you were working with HTML, you would have to do some pre-processing to ensure it is well-formed XML before trying to run it through an XML parser. It might just be easier to parse the HTML first with an HTML parser, such as jsoup, and then build your org.w3c.dom.Document by walking through the HTML parser's produced tree (org.jsoup.nodes.Document in the case of jsoup).

There is an org.w3c.dom.html.HTMLDocument interface, which extends org.w3c.dom.Document. The only implementation I found was in Xerces-j (2.11.0) in the form of org.apache.html.dom.HTMLDocumentImpl. At first this seems promising, however upon closer examination, we find that there are some issues.

1. There is not a clear, "clean" way to obtain an instance of an object implementing the org.w3c.dom.html.HTMLDocument interface.

With Xerces we normally would obtain a Document object using a DocumentBuilder in the following fashion:

DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.newDocument();
//or doc = builder.parse(xmlFile) if parsing from a file

Or using a DOMImplementation variety:

DOMImplementationRegistry registry = DOMImplementationRegistry.newInstance();
DOMImplementationLS impl = (DOMImplementationLS)registry.getDOMImplementation("LS");
LSParser lsParser = impl.createLSParser(DOMImplementationLS.MODE_SYNCHRONOUS, null);
Document document = lsParser.parseURI("myFile.xml");

In both cases, we are purely using org.w3c.dom.* interfaces to obtain the Documentobject.

The closest equivalent I found for HTMLDocument was something like this:

HTMLDOMImplementation htmlDocImpl = HTMLDOMImplementationImpl.getHTMLDOMImplementation();
HTMLDocument htmlDoc = htmlDocImpl.createHTMLDocument("My Title");

This requires us to directly instantiate internal implementation classes making us implementation dependent on Xerces.

(Note: I also saw Xerces also had an internal HTMLBuilder (which implements the deprecated DocumentHandler) that can supposedly generate an HTMLDocument using a SAX parser, but I didn't bother looking into it.)

2. org.w3c.dom.html.HTMLDocument does not generate proper XHTML.

Although, you can search the HTMLDocument tree using getElementsByTagName(String tagname) in a case-insensitive manner, all of the element names are saved internally in ALL CAPS. But XHTML elements and attribute names are supposed to be in all lowercase. (This could be worked around by walking the entire document tree and using Document's renameNode() method to change all of the element's names to lowercase.)

Additionally, an XHTML document is supposed to have a proper DOCTYPE declaration and xmlns declaration for the XHTML namespace . There doesn't seem to be a straightforward way to set those in an HTMLDocument (unless you do some fiddling with internal Xerces implementations).

3. org.w3c.dom.html.HTMLDocument has little documentation, and Xerces implementation of the interface seems incomplete.

I didn't scour the entire Internet, but the only documentation I found for HTMLDocument was the previously linked JavaDocs, and comments in the source code of the Xerces internal implementation. In those comments, I also found notes that several different parts of the interface weren't implemented. (Sidenote: I really got the impression that the org.w3c.dom.html.HTMLDocument interface itself isn't really used by anyone and perhaps is incomplete itself.)

For those reasons, I think it's better to avoid org.w3c.dom.html.HTMLDocument and just do what we can with org.w3c.dom.Document. What can we do?

Well one approach is to extend org.apache.xerces.dom.DocumentImpl (which extends org.apache.xerces.dom.CoreDocumentImpl which implements org.w3c.dom.Document). This approach doesn't require much code, but it still makes us implementation dependent on Xerces since we are extending DocumentImpl. In our MyHTMLDocumentImpl, we are just converting all tag names to lowercase on element creation and searches. This will allow use of Document#getElementsByTagName(String tagname) in a case-insensitive manner.

MyHTMLDocumentImpl:

import org.apache.xerces.dom.DocumentImpl;
import org.apache.xerces.dom.DocumentTypeImpl;
import org.w3c.dom.DOMException;
import org.w3c.dom.Document;
import org.w3c.dom.DocumentType;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;

//a base class somewhere in the hierarchy implements org.w3c.dom.Document
public class MyHTMLDocumentImpl extends DocumentImpl {

    private static final long serialVersionUID = 1658286253541962623L;


    /**
     * Creates an Document with basic elements required to meet
     * the <a href="http://www.w3.org/TR/xhtml1/#strict">XHTML standards</a>.
     * <pre>
     * {@code
     * <?xml version="1.0" encoding="UTF-8"?>
     * <!DOCTYPE html 
     *     PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" 
     *     "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
     * <html xmlns="http://www.w3.org/1999/xhtml">
     *     <head>
     *         <title>My Title</title>
     *     </head>
     *     <body/>
     * </html>
     * }
     * </pre>
     * 
     * @param title desired text content for title tag. If null, no text will be added.
     * @return basic HTML Document. 
     */
    public static Document makeBasicHtmlDoc(String title) {
        Document htmlDoc = new MyHTMLDocumentImpl();
        DocumentType docType = new DocumentTypeImpl(null, "html",
                "-//W3C//DTD XHTML 1.0 Strict//EN",
                "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd");
        htmlDoc.appendChild(docType);
        Element htmlElement = htmlDoc.createElementNS("http://www.w3.org/1999/xhtml", "html");
        htmlDoc.appendChild(htmlElement);
        Element headElement = htmlDoc.createElement("head");
        htmlElement.appendChild(headElement);
        Element titleElement = htmlDoc.createElement("title");
        if(title != null)
            titleElement.setTextContent(title);
        headElement.appendChild(titleElement);
        Element bodyElement = htmlDoc.createElement("body");
        htmlElement.appendChild(bodyElement);

        return htmlDoc;
    }

    /**
     * This method will allow us to create a our
     * MyHTMLDocumentImpl from an existing Document.
     */
    public static Document createFrom(Document doc) {
        Document htmlDoc = new MyHTMLDocumentImpl();
        DocumentType originDocType = doc.getDoctype();
        if(originDocType != null) {
            DocumentType docType = new DocumentTypeImpl(null, originDocType.getName(),
                    originDocType.getPublicId(),
                    originDocType.getSystemId());
            htmlDoc.appendChild(docType);
        }
        Node docElement = doc.getDocumentElement();
        if(docElement != null) {
            Node copiedDocElement = docElement.cloneNode(true);
            htmlDoc.adoptNode(copiedDocElement);
            htmlDoc.appendChild(copiedDocElement);
        }
        return htmlDoc;
    }

    private MyHTMLDocumentImpl() {
        super();
    }

    @Override
    public Element createElement(String tagName) throws DOMException {
        return super.createElement(tagName.toLowerCase());
    }

    @Override
    public Element createElementNS(String namespaceURI, String qualifiedName) throws DOMException {
        return super.createElementNS(namespaceURI, qualifiedName.toLowerCase());
    }

    @Override
    public NodeList getElementsByTagName(String tagname) {
        return super.getElementsByTagName(tagname.toLowerCase());
    }

    @Override
    public NodeList getElementsByTagNameNS(String namespaceURI, String localName) {
        return super.getElementsByTagNameNS(namespaceURI, localName.toLowerCase());
    }

    @Override
    public Node renameNode(Node n, String namespaceURI, String qualifiedName) throws DOMException {
        return super.renameNode(n, namespaceURI, qualifiedName.toLowerCase());
    }
}

Tester:

import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStream;

import org.w3c.dom.DOMConfiguration;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.NodeList;
import org.w3c.dom.bootstrap.DOMImplementationRegistry;
import org.w3c.dom.ls.DOMImplementationLS;
import org.w3c.dom.ls.LSOutput;
import org.w3c.dom.ls.LSSerializer;


public class HTMLDocumentTest {

    private final static int P_ELEMENT_NUM = 3;

    public static void main(String[] args) //I'm throwing all my exceptions here to shorten the example, but obviously you should handle them appropriately.
            throws ClassNotFoundException, InstantiationException, IllegalAccessException, ClassCastException, IOException {

        Document htmlDoc = MyHTMLDocumentImpl.makeBasicHtmlDoc("My Title");

        //populate the html doc with some example content
        Element bodyElement = (Element) htmlDoc.getElementsByTagName("body").item(0);
        for(int i = 0; i < P_ELEMENT_NUM; ++i) {
            Element pElement = htmlDoc.createElement("p");
            String id = Integer.toString(i+1);
            pElement.setAttribute("id", "anId"+id);
            pElement.setTextContent("Here is some text"+id+".");
            bodyElement.appendChild(pElement);
        }

        //get the title element in a case insensitive manner.
        NodeList titleNodeList = htmlDoc.getElementsByTagName("tItLe");
        for(int i = 0; i < titleNodeList.getLength(); ++i)
            System.out.println(titleNodeList.item(i).getTextContent());

        System.out.println();

        {//get all p elements searching with lowercase
            NodeList pNodeList = htmlDoc.getElementsByTagName("p");
            for(int i = 0; i < pNodeList.getLength(); ++i) {
                System.out.println(pNodeList.item(i).getTextContent());
            }
        }

        System.out.println();

        {//get all p elements searching with uppercase
            NodeList pNodeList = htmlDoc.getElementsByTagName("P");
            for(int i = 0; i < pNodeList.getLength(); ++i) {
                System.out.println(pNodeList.item(i).getTextContent());
            }
        }

        System.out.println();

        //to serialize
        DOMImplementationRegistry registry = DOMImplementationRegistry.newInstance();
        DOMImplementationLS domImplLS = (DOMImplementationLS) registry.getDOMImplementation("LS");

        LSSerializer lsSerializer = domImplLS.createLSSerializer();
        DOMConfiguration domConfig = lsSerializer.getDomConfig();
        domConfig.setParameter("format-pretty-print", true);  //if you want it pretty and indented

        LSOutput lsOutput = domImplLS.createLSOutput();
        lsOutput.setEncoding("UTF-8");

        //to write to file
        try (OutputStream os = new FileOutputStream(new File("myFile.html"))) {
            lsOutput.setByteStream(os);
            lsSerializer.write(htmlDoc, lsOutput);
        }

        //to print to screen
        System.out.println(lsSerializer.writeToString(htmlDoc)); 
    }

}

Output:

My Title

Here is some text1.
Here is some text2.
Here is some text3.

Here is some text1.
Here is some text2.
Here is some text3.

<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
    <head>
        <title>My Title</title>
    </head>
    <body>
        <p id="anId1">Here is some text1.</p>
        <p id="anId2">Here is some text2.</p>
        <p id="anId3">Here is some text3.</p>
    </body>
</html>

Another approach similar to the above is to instead make a Document wrapper that wraps a Document object and implements the Document interface itself. This requires more code than the "extending DocumentImpl" approach, but this way is "cleaner" as we don't have to care about particular Document implementations. The extra code for this approach isn't difficult; it's just a bit tedious to provide all those wrapper implementations for the Document methods. I haven't completely worked this out yet and there may be some problems, but if it works, this is the general idea:

public class MyHTMLDocumentWrapper implements Document {

    private Document doc;

    public MyHTMLDocumentWrapper(Document doc) {
        //...
        this.doc = doc;
        //...
    }

    //...
}

Whether it's org.w3c.dom.html.HTMLDocument, one of the approaches I mentioned above, or something else, maybe these suggestions will help give you an idea of how to proceed.

Edit:

In my parsing tests while trying to parse the following XHTML file, Xerces would hang down in an entity management class trying to open an http connection. Why I don't know? Especially since I tested on a local html file with with no entities. (Maybe something to do with the DOCTYPE or namespace?) This is the document:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC 
    "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
    <head>
        <title>My Title</title>
    </head>
    <body>
        <p id="anId1">Here is some text1.</p>
        <p id="anId2">Here is some text2.</p>
        <p id="anId3">Here is some text3.</p>
    </body>
</html>

200

answered Oct 04 '22 10:10

dbank

Related questions
                            
                                GSON and InstanceCreator issue
                            
                                Vaadin 7 - Good framework but not for my project [closed]
                            
                                How do you specify a single test to be run by play framework's "test-only"command
                            
                                URI/URL and String what is the difference?
                            
                                Java EE 7 - Injection into Runnable/Callable object
                            
                                How to use Asynchronous Callbacks in Jersey 2 in tomcat 7
                            
                                How can I include ChromeDriver in a JAR?
                            
                                type erasure in implementation of ArrayList in Java
                            
                                jaxb2-maven-plugin only executing first execution
                            
                                What is the right way to use Cassandra driver from a web application
                            
                                Android - Scrolling Vertically with a GridLayout
                            
                                Spring security and custom AuthenticationFilter with Spring boot
                            
                                Is there a java look and feel based on the flat design concept? [duplicate]
                            
                                How do you quickly close a nonresponsive websocket in Java Spring Tomcat?
                            
                                Difference between SynchronousQueue vs TransferQueue
                            
                                jar edit and re-compile in simple way
                            
                                Java switch statement using class.getSimpleName() gives Constant express required error
                            
                                Writing a multithreaded mapping iterator in Java
                            
                                set a (method) breakpoint for a particular object (and not all instances of that type) in java
                            
                                Unsplittable Spliterators

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I build an HTML org.w3c.dom.Document?

Tags:

java

html

dom

xml

Dmitry Minkovsky

People also ask

1 Answers

dbank

Recent Activity

Donate For Us