Below is a mock-up of a document I'm working on:
<div>
<h4>Area</h4>
<span class="aclass"> </span>
<span class="bclass">
<strong>Address:</strong>
10 Downing Street
London
SW1
</span>
</div>
I'm getting the address like this:
response.xpath(u".//h4[. = 'Area']/following-sibling::span[contains(.,'Address:')]/text()").extract()
which returns
[u'\r\n \t', u'\r\n 10 Downing Street\r\n\r\n London \r\n \r\n SW1\r\n ']
I'm trying to clean that up with normalize-space. I've tried putting it in every location I could think of, but it either tells me there's a syntax error, or returns an empty string.
Updating to add that I'm trying to get this working without changing the selector too much. I have similar cases which don't have the <strong>
tag, for example. The selector is overcomplicated in the example I've prepared here, but in the live version, I have to take that rather convoluted route to get to the address.
Regarding the possible duplicate Following the advice in the possible duplicate, I added /normalize-space(.)
giving this:
(u".//h4[. = 'Area']/following-sibling::span[contains(.,'Address:')]/text()/normalize-space(.)").extract()
That produces a ValueError: Invalid XPath:
error.
You can locate the strong
element, get the following text sibling and normalize it:
In [1]: response.xpath(u"normalize-space(.//strong[. = 'Address:']/following-sibling::text())").extract()
Out[1]: [u'10 Downing Street London SW1']
Alternatively, you can look into Item Loaders and input and output processors. I often use Join()
, TakeFirst()
and MapCompose(unicode.strip)
for cleaning up the extracted data from extra newlines or spaces.
"normalize-space(//strong[contains(text(), 'Address:')]/following-sibling::node())"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With