Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I build an HTML org.w3c.dom.Document?

Tags:

java

html

dom

xml

The documentation of the Document interface describes the interface as:

The Document interface represents the entire HTML or XML document.

javax.xml.parsers.DocumentBuilder builds XML Documents. However, I am unable to find a way to build a Document that is an HTML Document!

I want an HTML Document because I am trying to build a document that I then pass to a library that is expecting an HTML Document. This library uses Document#getElementsByTagName(String tagname) in a non case-sensitive manner, which is fine for HTML, but not for XML.

I've looked around, and am not finding anything. Items like How to convert an Html source of a webpage into org.w3c.dom.Document in java? don't actually have an answer.

like image 227
Dmitry Minkovsky Avatar asked Mar 13 '15 21:03

Dmitry Minkovsky


People also ask

What is org W3C DOM document?

Package org. w3c. dom Description. Provides the interfaces for the Document Object Model (DOM) which is a component API of the Java API for XML Processing. The Document Object Model Level 2 Core API allows programs to dynamically access and update the content and structure of documents.

What are the 3 different parts of W3C DOM standard?

The DOM is separated into three parts: Core, HTML, and XML.


1 Answers

You seem to have two explicit requirements:

  1. You need to represent HTML as a org.w3c.dom.Document.
  2. You need Document#getElementsByTagName(String tagname) to operate in a case-insensitive manner.

If you are trying to work with HTML using org.w3c.dom.Document, then I assume you are working with some flavor of XHTML. Because an XML API, such as DOM, is going to expect well-formed XML. HTML isn't necessarily well-formed XML, but XHTML is well-formed XML. Even if you were working with HTML, you would have to do some pre-processing to ensure it is well-formed XML before trying to run it through an XML parser. It might just be easier to parse the HTML first with an HTML parser, such as jsoup, and then build your org.w3c.dom.Document by walking through the HTML parser's produced tree (org.jsoup.nodes.Document in the case of jsoup).


There is an org.w3c.dom.html.HTMLDocument interface, which extends org.w3c.dom.Document. The only implementation I found was in Xerces-j (2.11.0) in the form of org.apache.html.dom.HTMLDocumentImpl. At first this seems promising, however upon closer examination, we find that there are some issues.

1. There is not a clear, "clean" way to obtain an instance of an object implementing the org.w3c.dom.html.HTMLDocument interface.

With Xerces we normally would obtain a Document object using a DocumentBuilder in the following fashion:

DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.newDocument();
//or doc = builder.parse(xmlFile) if parsing from a file

Or using a DOMImplementation variety:

DOMImplementationRegistry registry = DOMImplementationRegistry.newInstance();
DOMImplementationLS impl = (DOMImplementationLS)registry.getDOMImplementation("LS");
LSParser lsParser = impl.createLSParser(DOMImplementationLS.MODE_SYNCHRONOUS, null);
Document document = lsParser.parseURI("myFile.xml");

In both cases, we are purely using org.w3c.dom.* interfaces to obtain the Documentobject.

The closest equivalent I found for HTMLDocument was something like this:

HTMLDOMImplementation htmlDocImpl = HTMLDOMImplementationImpl.getHTMLDOMImplementation();
HTMLDocument htmlDoc = htmlDocImpl.createHTMLDocument("My Title");

This requires us to directly instantiate internal implementation classes making us implementation dependent on Xerces.

(Note: I also saw Xerces also had an internal HTMLBuilder (which implements the deprecated DocumentHandler) that can supposedly generate an HTMLDocument using a SAX parser, but I didn't bother looking into it.)

2. org.w3c.dom.html.HTMLDocument does not generate proper XHTML.

Although, you can search the HTMLDocument tree using getElementsByTagName(String tagname) in a case-insensitive manner, all of the element names are saved internally in ALL CAPS. But XHTML elements and attribute names are supposed to be in all lowercase. (This could be worked around by walking the entire document tree and using Document's renameNode() method to change all of the element's names to lowercase.)

Additionally, an XHTML document is supposed to have a proper DOCTYPE declaration and xmlns declaration for the XHTML namespace . There doesn't seem to be a straightforward way to set those in an HTMLDocument (unless you do some fiddling with internal Xerces implementations).

3. org.w3c.dom.html.HTMLDocument has little documentation, and Xerces implementation of the interface seems incomplete.

I didn't scour the entire Internet, but the only documentation I found for HTMLDocument was the previously linked JavaDocs, and comments in the source code of the Xerces internal implementation. In those comments, I also found notes that several different parts of the interface weren't implemented. (Sidenote: I really got the impression that the org.w3c.dom.html.HTMLDocument interface itself isn't really used by anyone and perhaps is incomplete itself.)


For those reasons, I think it's better to avoid org.w3c.dom.html.HTMLDocument and just do what we can with org.w3c.dom.Document. What can we do?

Well one approach is to extend org.apache.xerces.dom.DocumentImpl (which extends org.apache.xerces.dom.CoreDocumentImpl which implements org.w3c.dom.Document). This approach doesn't require much code, but it still makes us implementation dependent on Xerces since we are extending DocumentImpl. In our MyHTMLDocumentImpl, we are just converting all tag names to lowercase on element creation and searches. This will allow use of Document#getElementsByTagName(String tagname) in a case-insensitive manner.

MyHTMLDocumentImpl:

import org.apache.xerces.dom.DocumentImpl;
import org.apache.xerces.dom.DocumentTypeImpl;
import org.w3c.dom.DOMException;
import org.w3c.dom.Document;
import org.w3c.dom.DocumentType;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;

//a base class somewhere in the hierarchy implements org.w3c.dom.Document
public class MyHTMLDocumentImpl extends DocumentImpl {

    private static final long serialVersionUID = 1658286253541962623L;


    /**
     * Creates an Document with basic elements required to meet
     * the <a href="http://www.w3.org/TR/xhtml1/#strict">XHTML standards</a>.
     * <pre>
     * {@code
     * <?xml version="1.0" encoding="UTF-8"?>
     * <!DOCTYPE html 
     *     PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" 
     *     "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
     * <html xmlns="http://www.w3.org/1999/xhtml">
     *     <head>
     *         <title>My Title</title>
     *     </head>
     *     <body/>
     * </html>
     * }
     * </pre>
     * 
     * @param title desired text content for title tag. If null, no text will be added.
     * @return basic HTML Document. 
     */
    public static Document makeBasicHtmlDoc(String title) {
        Document htmlDoc = new MyHTMLDocumentImpl();
        DocumentType docType = new DocumentTypeImpl(null, "html",
                "-//W3C//DTD XHTML 1.0 Strict//EN",
                "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd");
        htmlDoc.appendChild(docType);
        Element htmlElement = htmlDoc.createElementNS("http://www.w3.org/1999/xhtml", "html");
        htmlDoc.appendChild(htmlElement);
        Element headElement = htmlDoc.createElement("head");
        htmlElement.appendChild(headElement);
        Element titleElement = htmlDoc.createElement("title");
        if(title != null)
            titleElement.setTextContent(title);
        headElement.appendChild(titleElement);
        Element bodyElement = htmlDoc.createElement("body");
        htmlElement.appendChild(bodyElement);

        return htmlDoc;
    }

    /**
     * This method will allow us to create a our
     * MyHTMLDocumentImpl from an existing Document.
     */
    public static Document createFrom(Document doc) {
        Document htmlDoc = new MyHTMLDocumentImpl();
        DocumentType originDocType = doc.getDoctype();
        if(originDocType != null) {
            DocumentType docType = new DocumentTypeImpl(null, originDocType.getName(),
                    originDocType.getPublicId(),
                    originDocType.getSystemId());
            htmlDoc.appendChild(docType);
        }
        Node docElement = doc.getDocumentElement();
        if(docElement != null) {
            Node copiedDocElement = docElement.cloneNode(true);
            htmlDoc.adoptNode(copiedDocElement);
            htmlDoc.appendChild(copiedDocElement);
        }
        return htmlDoc;
    }

    private MyHTMLDocumentImpl() {
        super();
    }

    @Override
    public Element createElement(String tagName) throws DOMException {
        return super.createElement(tagName.toLowerCase());
    }

    @Override
    public Element createElementNS(String namespaceURI, String qualifiedName) throws DOMException {
        return super.createElementNS(namespaceURI, qualifiedName.toLowerCase());
    }

    @Override
    public NodeList getElementsByTagName(String tagname) {
        return super.getElementsByTagName(tagname.toLowerCase());
    }

    @Override
    public NodeList getElementsByTagNameNS(String namespaceURI, String localName) {
        return super.getElementsByTagNameNS(namespaceURI, localName.toLowerCase());
    }

    @Override
    public Node renameNode(Node n, String namespaceURI, String qualifiedName) throws DOMException {
        return super.renameNode(n, namespaceURI, qualifiedName.toLowerCase());
    }
}

Tester:

import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStream;

import org.w3c.dom.DOMConfiguration;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.NodeList;
import org.w3c.dom.bootstrap.DOMImplementationRegistry;
import org.w3c.dom.ls.DOMImplementationLS;
import org.w3c.dom.ls.LSOutput;
import org.w3c.dom.ls.LSSerializer;


public class HTMLDocumentTest {

    private final static int P_ELEMENT_NUM = 3;

    public static void main(String[] args) //I'm throwing all my exceptions here to shorten the example, but obviously you should handle them appropriately.
            throws ClassNotFoundException, InstantiationException, IllegalAccessException, ClassCastException, IOException {

        Document htmlDoc = MyHTMLDocumentImpl.makeBasicHtmlDoc("My Title");

        //populate the html doc with some example content
        Element bodyElement = (Element) htmlDoc.getElementsByTagName("body").item(0);
        for(int i = 0; i < P_ELEMENT_NUM; ++i) {
            Element pElement = htmlDoc.createElement("p");
            String id = Integer.toString(i+1);
            pElement.setAttribute("id", "anId"+id);
            pElement.setTextContent("Here is some text"+id+".");
            bodyElement.appendChild(pElement);
        }

        //get the title element in a case insensitive manner.
        NodeList titleNodeList = htmlDoc.getElementsByTagName("tItLe");
        for(int i = 0; i < titleNodeList.getLength(); ++i)
            System.out.println(titleNodeList.item(i).getTextContent());

        System.out.println();

        {//get all p elements searching with lowercase
            NodeList pNodeList = htmlDoc.getElementsByTagName("p");
            for(int i = 0; i < pNodeList.getLength(); ++i) {
                System.out.println(pNodeList.item(i).getTextContent());
            }
        }

        System.out.println();

        {//get all p elements searching with uppercase
            NodeList pNodeList = htmlDoc.getElementsByTagName("P");
            for(int i = 0; i < pNodeList.getLength(); ++i) {
                System.out.println(pNodeList.item(i).getTextContent());
            }
        }

        System.out.println();

        //to serialize
        DOMImplementationRegistry registry = DOMImplementationRegistry.newInstance();
        DOMImplementationLS domImplLS = (DOMImplementationLS) registry.getDOMImplementation("LS");

        LSSerializer lsSerializer = domImplLS.createLSSerializer();
        DOMConfiguration domConfig = lsSerializer.getDomConfig();
        domConfig.setParameter("format-pretty-print", true);  //if you want it pretty and indented

        LSOutput lsOutput = domImplLS.createLSOutput();
        lsOutput.setEncoding("UTF-8");

        //to write to file
        try (OutputStream os = new FileOutputStream(new File("myFile.html"))) {
            lsOutput.setByteStream(os);
            lsSerializer.write(htmlDoc, lsOutput);
        }

        //to print to screen
        System.out.println(lsSerializer.writeToString(htmlDoc)); 
    }

}

Output:

My Title

Here is some text1.
Here is some text2.
Here is some text3.

Here is some text1.
Here is some text2.
Here is some text3.

<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
    <head>
        <title>My Title</title>
    </head>
    <body>
        <p id="anId1">Here is some text1.</p>
        <p id="anId2">Here is some text2.</p>
        <p id="anId3">Here is some text3.</p>
    </body>
</html>

Another approach similar to the above is to instead make a Document wrapper that wraps a Document object and implements the Document interface itself. This requires more code than the "extending DocumentImpl" approach, but this way is "cleaner" as we don't have to care about particular Document implementations. The extra code for this approach isn't difficult; it's just a bit tedious to provide all those wrapper implementations for the Document methods. I haven't completely worked this out yet and there may be some problems, but if it works, this is the general idea:

public class MyHTMLDocumentWrapper implements Document {

    private Document doc;

    public MyHTMLDocumentWrapper(Document doc) {
        //...
        this.doc = doc;
        //...
    }

    //...
}

Whether it's org.w3c.dom.html.HTMLDocument, one of the approaches I mentioned above, or something else, maybe these suggestions will help give you an idea of how to proceed.


Edit:

In my parsing tests while trying to parse the following XHTML file, Xerces would hang down in an entity management class trying to open an http connection. Why I don't know? Especially since I tested on a local html file with with no entities. (Maybe something to do with the DOCTYPE or namespace?) This is the document:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC 
    "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
    <head>
        <title>My Title</title>
    </head>
    <body>
        <p id="anId1">Here is some text1.</p>
        <p id="anId2">Here is some text2.</p>
        <p id="anId3">Here is some text3.</p>
    </body>
</html>
like image 200
dbank Avatar answered Oct 04 '22 10:10

dbank