Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

XPath normalize-space() to return a sequence of normalized strings

Tags:

java

xml

xpath

I need to use the XPath function normalized-space() to normalize the text I want to extract from a XHTML document: http://test.anahnarciso.com/clean_bigbook_0.html

I'm using the following expression:

//*[@slot="address"]/normalize-space(.)

Which works perfectly in Qizx Studio, the tool I use to test XPath expressions.

    let $doc := doc('http://test.anahnarciso.com/clean_bigbook_0.html')
    return $doc//*[@slot="address"]/normalize-space(.)

This simple query returns a sequence of xs:string.

144 Hempstead Tpke
403 West St
880 Old Country Rd
8412 164th St
8412 164th St
1 Irving Pl
1622 McDonald Ave
255 Conklin Ave
22011 Hempstead Ave
7909 Queens Blvd
11820 Queens Blvd
1027 Atlantic Ave
1068 Utica Ave
1002 Clintonville St
1002 Clintonville St
1156 Hempstead Tpke
Route 49
10007 Rockaway Blvd
12694 Willets Point Blvd
343 James St

Now, I want to use the previous expression in my Java code.

String exp = "//*[@slot=\"address"\"]/normalize-space(.)";
XPath xpath = XPathFactory.newInstance().newXPath();
XPathExpression expr = xpath.compile(exp);
Object result = expr.evaluate(doc, XPathConstants.NODESET);

But the last line throws an Exception:

Cannot convert XPath value to Java object: required class is org.w3c.dom.NodeList; supplied value has type xs:string

Obvsiously, I should change XPathConstants.NODESET for something; I tried XPathConstants.STRING but it only returns the first element of the sequence.

How can I obtain something like an array of Strings?

Thanks in advance.

like image 290
anahnarciso Avatar asked Jul 07 '12 20:07

anahnarciso


1 Answers

Your expression works in XPath 2.0, but is illegal in XPath 1.0 (which is used in Java) - it should be normalize-space(//*[@slot='address']).

Anyway, in XPath 1.0, when normalize-space() is called on a node-set, only the first node (in document order) is taken.

In order to do what you want to do, you'll need to use a XPath 2.0 compatible parser, or traverse the resulting node-set and call normalize-space() on every node:

XPath xpath = XPathFactory.newInstance().newXPath();
XPathExpression expr;

String select = "//*[@slot='address']";
expr = xpath.compile(select);
NodeList result = (NodeList)expr.evaluate(input, XPathConstants.NODESET);

String normalize = "normalize-space(.)";
expr = xpath.compile(normalize);

int length = result.getLength();
for (int i = 0; i < length; i++) {
    System.out.println(expr.evaluate(result.item(i), XPathConstants.STRING));
}

...outputs exactly your given output.

like image 146
Petr Janeček Avatar answered Sep 17 '22 23:09

Petr Janeček