Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using normalize-space with Scrapy

Below is a mock-up of a document I'm working on:

<div>
<h4>Area</h4>
  <span class="aclass"> </span>
  <span class="bclass">
        <strong>Address:</strong>
  10 Downing Street

  London

  SW1
  </span>
</div>

I'm getting the address like this:

response.xpath(u".//h4[. = 'Area']/following-sibling::span[contains(.,'Address:')]/text()").extract()

which returns

[u'\r\n  \t', u'\r\n  10 Downing Street\r\n\r\n  London     \r\n  \r\n  SW1\r\n  ']

I'm trying to clean that up with normalize-space. I've tried putting it in every location I could think of, but it either tells me there's a syntax error, or returns an empty string.

Updating to add that I'm trying to get this working without changing the selector too much. I have similar cases which don't have the <strong> tag, for example. The selector is overcomplicated in the example I've prepared here, but in the live version, I have to take that rather convoluted route to get to the address.

Regarding the possible duplicate Following the advice in the possible duplicate, I added /normalize-space(.) giving this:

(u".//h4[. = 'Area']/following-sibling::span[contains(.,'Address:')]/text()/normalize-space(.)").extract()

That produces a ValueError: Invalid XPath: error.

like image 315
user3185563 Avatar asked Feb 08 '23 11:02

user3185563


2 Answers

You can locate the strong element, get the following text sibling and normalize it:

In [1]: response.xpath(u"normalize-space(.//strong[. = 'Address:']/following-sibling::text())").extract()
Out[1]: [u'10 Downing Street London SW1']

Alternatively, you can look into Item Loaders and input and output processors. I often use Join(), TakeFirst() and MapCompose(unicode.strip) for cleaning up the extracted data from extra newlines or spaces.

like image 162
alecxe Avatar answered Feb 12 '23 11:02

alecxe


"normalize-space(//strong[contains(text(), 'Address:')]/following-sibling::node())"
like image 33
eLRuLL Avatar answered Feb 12 '23 10:02

eLRuLL