I am writing an HTML parser, which uses TagSoup to pass a well-formed structure to XMLSlurper.
Here's the generalised code:
def htmlText = """
<html>
<body>
<div id="divId" class="divclass">
<h2>Heading 2</h2>
<ol>
<li><h3><a class="box" href="#href1">href1 link text</a> <span>extra stuff</span></h3><address>Here is the address<span>Telephone number: <strong>telephone</strong></span></address></li>
<li><h3><a class="box" href="#href2">href2 link text</a> <span>extra stuff</span></h3><address>Here is another address<span>Another telephone: <strong>0845 1111111</strong></span></address></li>
</ol>
</div>
</body>
</html>
"""
def html = new XmlSlurper(new org.ccil.cowan.tagsoup.Parser()).parseText( htmlText );
html.'**'.grep { it.@class == 'divclass' }.ol.li.each { linkItem ->
def link = linkItem.h3.a.@href
def address = linkItem.address.text()
println "$link: $address\n"
}
I would expect the each to let me select each 'li' in turn so I can retrieve the corresponding href and address details. Instead, I am getting this output:
#href1#href2: Here is the addressTelephone number: telephoneHere is another addressAnother telephone: 0845 1111111
I've checked various example on the web and these either deal with XML, or are one-liner examples like "retrieve all links from this file". It's seems that the it.h3.a.@href expression is collecting all hrefs in the text, even though I'm passing it a reference to the parent 'li' node.
Can you let me know:
Thanks.
Replace grep with find:
html.'**'.find { it.@class == 'divclass' }.ol.li.each { linkItem ->
def link = linkItem.h3.a.@href
def address = linkItem.address.text()
println "$link: $address\n"
}
then you'll get
#href1: Here is the addressTelephone number: telephone
#href2: Here is another addressAnother telephone: 0845 1111111
grep returns an ArrayList but find returns a NodeChild class:
println html.'**'.grep { it.@class == 'divclass' }.getClass()
println html.'**'.find { it.@class == 'divclass' }.getClass()
results in:
class java.util.ArrayList
class groovy.util.slurpersupport.NodeChild
thus if you wanted to use grep you could then nest another each like this for it to work
html.'**'.grep { it.@class == 'divclass' }.ol.li.each {
it.each { linkItem ->
def link = linkItem.h3.a.@href
def address = linkItem.address.text()
println "$link: $address\n"
}
}
Long story short, in your case, use find rather than grep.
This was is a tricky one. When there is just one element with class='divclass' the previous answer sure is fine. If there were multiple results from grep, then a find() for a single result is not the answer. Pointing out that the result is an ArrayList is correct. Inserting an outer nested .each() loop provides a GPathResult in the closure parameter div. From here the drill down can continue with the expected result.
html."**".grep { it.@class == 'divclass' }.each { div -> div.ol.li.each { linkItem ->
def link = linkItem.h3.a.@href
def address = linkItem.address.text()
println "$link: $address\n"
}}
The behavior of the original code can use a bit more of an explanation as well. When a property is accessed on a List in Groovy, you'll get a new list (same size) with the property of each element in the list. The list found by grep() has just one entry. Then we get one entry for property ol, which is fine. Next we get the result of ol.it for that entry. It is a list of size() == 1 again, but this time with an entry of size() == 2. We could apply the outer loop there and get the same result, if we wanted to:
html."**".grep { it.@class == 'divclass' }.ol.li.each { it.each { linkItem ->
def link = linkItem.h3.a.@href
def address = linkItem.address
println "$link: $address\n"
}}
On any GPathResult representing multiple nodes, we get the concatenation of all text. That is the original result, first for @href, then for address.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With