I have a file of roughly the following shape:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title></title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta name='ocr-system' content='tesseract 3.02' />
<meta name='ocr-capabilities' content='ocr_page ocr_carea ocr_par ocr_line ocrx_word'/>
</head>
<body>
<div class='ocr_page' id='page_1' title='image "D:\DPC2\converted\60\60.tiff"; bbox 0 0 2479 3508; ppageno 0'>
<!-- LOTS OF CONTENT -->
</div>
</body>
</html>
Then I am using JDOM 2.x with the following XPath query:
//htmlFile is an input variable of type java.nio.Path
Document document = xmlBuilder.build(htmlFile.toFile());
XPathFactory factory = XPathFactory.instance();
XPathExpression<Element> xpePages =
factory.compile("//html/body/div[@class='ocr_page']", Filters.element());
List<Element> pages = xpePages.evaluate(document);
But it is never able to find any elements, what am I doing wrong in query?
Namespaces.
The xmlns="http://www.w3.org/1999/xhtml" means that elements with no prefix in the XML file are actually in the http://www.w3.org/1999/xhtml namespace, and you need to specify this in the XPath expression using a prefix:
XPathExpression<Element> xpePages =
factory.compile("/h:html/h:body/h:div[@class='ocr_page']",
Filters.element(),
null, // no variables
Namespace.getNamespace("h", "http://www.w3.org/1999/xhtml"));
You must use a prefix, as in XPath no prefix always means no namespace.
<html xmlns="http://www.w3.org/1999/xhtml"
Means that elements like html are in namespace http://www.w3.org/1999/xhtml
You've got a couple of ways forward
NamespaceContext (~name space manager), which looks rather onerous in this technology stack: https://stackoverflow.com/a/6390494/314291 //*[local-name()=='html' and namespace-uri()='http://www.w3.org/1999/xhtml']
/*[local-name()='body' and namespace-uri()='http://www.w3.org/1999/xhtml']
/* ... etc.
If you are confident that there is no conflict in the namespaces of the elements, you can choose to use just local-name()
//*[local-name()=='html']/*[local-name()='body']* ...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With