I am trying to get a html node from a file that will later be used to count all of its descendants. I am having issues with retrieving the element from the DOM. here are the steps i have taken so far.
First here is my html code:
<html>
<head>
<title></title>
</head>
<body>
<div id="container">
<a></a>
<div id="header">
<div id="firstchild">
<div>
<img></img>
</div>
<a></a>
<ul>
<li>
<a>Inbox</a>
</li>
<li>
<a>Logout</a>
</li>
</ul>
<form></form>
</div>
<div id="nextsibling"></div>
</div>
</div>
</body>
</html>
Second I built this function that will return and parse the file into a document.
public static Document buildDocument(String file){
try {
DocumentBuilderFactory docFactory = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = docFactory.newDocumentBuilder();
Document document = docBuilder.parse(file);
return document;
} catch (ParserConfigurationException | SAXException | IOException ex) {
System.out.println("the exception is: " + ex.toString());
}
return null;
}
Next in my main method I tried to set a Node object to a document elemet by way of getElementById like:
public Document doc = buildDocument("myHTMLFile");
org.w3c.dom.Node node = doc.getElementById("header");//the id of an html element
Correct me if I am wrong but this should result in the retreival of the node. However it is returning a null value. I do not understand why it is not returning the correct value. NOTE: that when debugging the code the document does contain all of the correct data as far as I can tell.
You do it wrong. Javadoc javadoc of getElementById said:
Returns the Element that has an ID attribute with the given value. If no such element exists, this returns null . ... The DOM implementation is expected to use the attribute Attr.isId to determine if an attribute is of type ID. Note: Attributes with the name "ID" or "id" are not of type ID unless so defined.
In your case the best solution is using XPath (simple query language to XML):
XPath xpath = XPathFactory.newInstance().newXPath();
Node node = (Node) xpath.evaluate("//*[@id='header']", document, XPathConstants.NODE);
Expression //*[@id='header'] - select all nodes in document which has attribute id with 'header' value.
It appears you are working with the generic XML DOM. XML expects IDs to be defined as such, so an element with an attribute, even if named "id", won't work unless designated as such.
Try finding an HTML-specific interface or adding a DOCTYPE which defines the id attribute as an ID type. (I wouldn't recommend the latter though because HTML5 has moved away from attempting an XHTML compatible approach even if it technically supports an XHTML serialization.) See Parse Web Site HTML with JAVA for recommendations on HTML-specific parsers.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With