The documentation of the Document
interface describes the interface as:
The Document interface represents the entire HTML or XML document.
javax.xml.parsers.DocumentBuilder
builds XML Document
s. However, I am unable to find a way to build a Document
that is an HTML Document
!
I want an HTML Document
because I am trying to build a document that I then pass to a library that is expecting an HTML Document
. This library uses Document#getElementsByTagName(String tagname)
in a non case-sensitive manner, which is fine for HTML, but not for XML.
I've looked around, and am not finding anything. Items like How to convert an Html source of a webpage into org.w3c.dom.Document in java? don't actually have an answer.
Package org. w3c. dom Description. Provides the interfaces for the Document Object Model (DOM) which is a component API of the Java API for XML Processing. The Document Object Model Level 2 Core API allows programs to dynamically access and update the content and structure of documents.
The DOM is separated into three parts: Core, HTML, and XML.
You seem to have two explicit requirements:
org.w3c.dom.Document
.Document#getElementsByTagName(String tagname)
to operate in a case-insensitive manner.If you are trying to work with HTML using org.w3c.dom.Document
, then I assume you are working with some flavor of XHTML. Because an XML API, such as DOM, is going to expect well-formed XML. HTML isn't necessarily well-formed XML, but XHTML is well-formed XML. Even if you were working with HTML, you would have to do some pre-processing to ensure it is well-formed XML before trying to run it through an XML parser. It might just be easier to parse the HTML first with an HTML parser, such as jsoup, and then build your org.w3c.dom.Document
by walking through the HTML parser's produced tree (org.jsoup.nodes.Document
in the case of jsoup).
There is an org.w3c.dom.html.HTMLDocument
interface, which extends org.w3c.dom.Document
. The only implementation I found was in Xerces-j (2.11.0) in the form of org.apache.html.dom.HTMLDocumentImpl. At first this seems promising, however upon closer examination, we find that there are some issues.
1. There is not a clear, "clean" way to obtain an instance of an object implementing the org.w3c.dom.html.HTMLDocument
interface.
With Xerces we normally would obtain a Document
object using a DocumentBuilder
in the following fashion:
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.newDocument();
//or doc = builder.parse(xmlFile) if parsing from a file
Or using a DOMImplementation
variety:
DOMImplementationRegistry registry = DOMImplementationRegistry.newInstance();
DOMImplementationLS impl = (DOMImplementationLS)registry.getDOMImplementation("LS");
LSParser lsParser = impl.createLSParser(DOMImplementationLS.MODE_SYNCHRONOUS, null);
Document document = lsParser.parseURI("myFile.xml");
In both cases, we are purely using org.w3c.dom.*
interfaces to obtain the Document
object.
The closest equivalent I found for HTMLDocument
was something like this:
HTMLDOMImplementation htmlDocImpl = HTMLDOMImplementationImpl.getHTMLDOMImplementation();
HTMLDocument htmlDoc = htmlDocImpl.createHTMLDocument("My Title");
This requires us to directly instantiate internal implementation classes making us implementation dependent on Xerces.
(Note: I also saw Xerces also had an internal HTMLBuilder
(which implements the deprecated DocumentHandler
) that can supposedly generate an HTMLDocument using a SAX parser, but I didn't bother looking into it.
)
2. org.w3c.dom.html.HTMLDocument
does not generate proper XHTML.
Although, you can search the HTMLDocument
tree using getElementsByTagName(String tagname)
in a case-insensitive manner, all of the element names are saved internally in ALL CAPS. But XHTML elements and attribute names are supposed to be in all lowercase. (This could be worked around by walking the entire document tree and using Document
's renameNode()
method to change all of the element's names to lowercase.)
Additionally, an XHTML document is supposed to have a proper DOCTYPE declaration and xmlns declaration for the XHTML namespace . There doesn't seem to be a straightforward way to set those in an HTMLDocument
(unless you do some fiddling with internal Xerces implementations).
3. org.w3c.dom.html.HTMLDocument
has little documentation, and Xerces implementation of the interface seems incomplete.
I didn't scour the entire Internet, but the only documentation I found for HTMLDocument
was the previously linked JavaDocs, and comments in the source code of the Xerces internal implementation. In those comments, I also found notes that several different parts of the interface weren't implemented. (Sidenote: I really got the impression that the org.w3c.dom.html.HTMLDocument
interface itself isn't really used by anyone and perhaps is incomplete itself.)
For those reasons, I think it's better to avoid org.w3c.dom.html.HTMLDocument
and just do what we can with org.w3c.dom.Document
. What can we do?
Well one approach is to extend org.apache.xerces.dom.DocumentImpl
(which extends org.apache.xerces.dom.CoreDocumentImpl
which implements org.w3c.dom.Document
). This approach doesn't require much code, but it still makes us implementation dependent on Xerces since we are extending DocumentImpl
. In our MyHTMLDocumentImpl
, we are just converting all tag names to lowercase on element creation and searches. This will allow use of Document#getElementsByTagName(String tagname)
in a case-insensitive manner.
MyHTMLDocumentImpl
:
import org.apache.xerces.dom.DocumentImpl;
import org.apache.xerces.dom.DocumentTypeImpl;
import org.w3c.dom.DOMException;
import org.w3c.dom.Document;
import org.w3c.dom.DocumentType;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
//a base class somewhere in the hierarchy implements org.w3c.dom.Document
public class MyHTMLDocumentImpl extends DocumentImpl {
private static final long serialVersionUID = 1658286253541962623L;
/**
* Creates an Document with basic elements required to meet
* the <a href="http://www.w3.org/TR/xhtml1/#strict">XHTML standards</a>.
* <pre>
* {@code
* <?xml version="1.0" encoding="UTF-8"?>
* <!DOCTYPE html
* PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
* "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
* <html xmlns="http://www.w3.org/1999/xhtml">
* <head>
* <title>My Title</title>
* </head>
* <body/>
* </html>
* }
* </pre>
*
* @param title desired text content for title tag. If null, no text will be added.
* @return basic HTML Document.
*/
public static Document makeBasicHtmlDoc(String title) {
Document htmlDoc = new MyHTMLDocumentImpl();
DocumentType docType = new DocumentTypeImpl(null, "html",
"-//W3C//DTD XHTML 1.0 Strict//EN",
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd");
htmlDoc.appendChild(docType);
Element htmlElement = htmlDoc.createElementNS("http://www.w3.org/1999/xhtml", "html");
htmlDoc.appendChild(htmlElement);
Element headElement = htmlDoc.createElement("head");
htmlElement.appendChild(headElement);
Element titleElement = htmlDoc.createElement("title");
if(title != null)
titleElement.setTextContent(title);
headElement.appendChild(titleElement);
Element bodyElement = htmlDoc.createElement("body");
htmlElement.appendChild(bodyElement);
return htmlDoc;
}
/**
* This method will allow us to create a our
* MyHTMLDocumentImpl from an existing Document.
*/
public static Document createFrom(Document doc) {
Document htmlDoc = new MyHTMLDocumentImpl();
DocumentType originDocType = doc.getDoctype();
if(originDocType != null) {
DocumentType docType = new DocumentTypeImpl(null, originDocType.getName(),
originDocType.getPublicId(),
originDocType.getSystemId());
htmlDoc.appendChild(docType);
}
Node docElement = doc.getDocumentElement();
if(docElement != null) {
Node copiedDocElement = docElement.cloneNode(true);
htmlDoc.adoptNode(copiedDocElement);
htmlDoc.appendChild(copiedDocElement);
}
return htmlDoc;
}
private MyHTMLDocumentImpl() {
super();
}
@Override
public Element createElement(String tagName) throws DOMException {
return super.createElement(tagName.toLowerCase());
}
@Override
public Element createElementNS(String namespaceURI, String qualifiedName) throws DOMException {
return super.createElementNS(namespaceURI, qualifiedName.toLowerCase());
}
@Override
public NodeList getElementsByTagName(String tagname) {
return super.getElementsByTagName(tagname.toLowerCase());
}
@Override
public NodeList getElementsByTagNameNS(String namespaceURI, String localName) {
return super.getElementsByTagNameNS(namespaceURI, localName.toLowerCase());
}
@Override
public Node renameNode(Node n, String namespaceURI, String qualifiedName) throws DOMException {
return super.renameNode(n, namespaceURI, qualifiedName.toLowerCase());
}
}
Tester:
import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStream;
import org.w3c.dom.DOMConfiguration;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.NodeList;
import org.w3c.dom.bootstrap.DOMImplementationRegistry;
import org.w3c.dom.ls.DOMImplementationLS;
import org.w3c.dom.ls.LSOutput;
import org.w3c.dom.ls.LSSerializer;
public class HTMLDocumentTest {
private final static int P_ELEMENT_NUM = 3;
public static void main(String[] args) //I'm throwing all my exceptions here to shorten the example, but obviously you should handle them appropriately.
throws ClassNotFoundException, InstantiationException, IllegalAccessException, ClassCastException, IOException {
Document htmlDoc = MyHTMLDocumentImpl.makeBasicHtmlDoc("My Title");
//populate the html doc with some example content
Element bodyElement = (Element) htmlDoc.getElementsByTagName("body").item(0);
for(int i = 0; i < P_ELEMENT_NUM; ++i) {
Element pElement = htmlDoc.createElement("p");
String id = Integer.toString(i+1);
pElement.setAttribute("id", "anId"+id);
pElement.setTextContent("Here is some text"+id+".");
bodyElement.appendChild(pElement);
}
//get the title element in a case insensitive manner.
NodeList titleNodeList = htmlDoc.getElementsByTagName("tItLe");
for(int i = 0; i < titleNodeList.getLength(); ++i)
System.out.println(titleNodeList.item(i).getTextContent());
System.out.println();
{//get all p elements searching with lowercase
NodeList pNodeList = htmlDoc.getElementsByTagName("p");
for(int i = 0; i < pNodeList.getLength(); ++i) {
System.out.println(pNodeList.item(i).getTextContent());
}
}
System.out.println();
{//get all p elements searching with uppercase
NodeList pNodeList = htmlDoc.getElementsByTagName("P");
for(int i = 0; i < pNodeList.getLength(); ++i) {
System.out.println(pNodeList.item(i).getTextContent());
}
}
System.out.println();
//to serialize
DOMImplementationRegistry registry = DOMImplementationRegistry.newInstance();
DOMImplementationLS domImplLS = (DOMImplementationLS) registry.getDOMImplementation("LS");
LSSerializer lsSerializer = domImplLS.createLSSerializer();
DOMConfiguration domConfig = lsSerializer.getDomConfig();
domConfig.setParameter("format-pretty-print", true); //if you want it pretty and indented
LSOutput lsOutput = domImplLS.createLSOutput();
lsOutput.setEncoding("UTF-8");
//to write to file
try (OutputStream os = new FileOutputStream(new File("myFile.html"))) {
lsOutput.setByteStream(os);
lsSerializer.write(htmlDoc, lsOutput);
}
//to print to screen
System.out.println(lsSerializer.writeToString(htmlDoc));
}
}
Output:
My Title
Here is some text1.
Here is some text2.
Here is some text3.
Here is some text1.
Here is some text2.
Here is some text3.
<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>My Title</title>
</head>
<body>
<p id="anId1">Here is some text1.</p>
<p id="anId2">Here is some text2.</p>
<p id="anId3">Here is some text3.</p>
</body>
</html>
Another approach similar to the above is to instead make a Document
wrapper that wraps a Document
object and implements the Document
interface itself. This requires more code than the "extending DocumentImpl
" approach, but this way is "cleaner" as we don't have to care about particular Document
implementations. The extra code for this approach isn't difficult; it's just a bit tedious to provide all those wrapper implementations for the Document
methods. I haven't completely worked this out yet and there may be some problems, but if it works, this is the general idea:
public class MyHTMLDocumentWrapper implements Document {
private Document doc;
public MyHTMLDocumentWrapper(Document doc) {
//...
this.doc = doc;
//...
}
//...
}
Whether it's org.w3c.dom.html.HTMLDocument
, one of the approaches I mentioned above, or something else, maybe these suggestions will help give you an idea of how to proceed.
Edit:
In my parsing tests while trying to parse the following XHTML file, Xerces would hang down in an entity management class trying to open an http connection. Why I don't know? Especially since I tested on a local html file with with no entities. (Maybe something to do with the DOCTYPE or namespace?) This is the document:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC
"-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>My Title</title>
</head>
<body>
<p id="anId1">Here is some text1.</p>
<p id="anId2">Here is some text2.</p>
<p id="anId3">Here is some text3.</p>
</body>
</html>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With