Normally, the item loader extracts data automatically before passing the values to the input processor:
- Data from xpath1 is extracted, and passed through the input processor of the name field. (Scrapy docs)
Is it possible to change this behaviour for certain elements of an item loader, so I can pass in a more complicated structure (in my opinion the selector)?
I have a HTML document like this:
<a class="foo" href="http://example.com">example 1</a>
<a class="foo" href="http://example.org">example 2</a>
And now I'd like to fetch these link elements in a spider
loader.add_css('links', '.foo')
and parse them in the item loader to get a list of values (after the output processor) like this:
[("http://example.com", "example 1"), ("http://example.org", "example 2")]
However, as item loaders do convert the input automatically to unicode, this does not seem so easy.
You can use .add_value() and "manually" construct a list of texts and hrefs:
links = [(item.css('::text').extract()[0],
item.css('::attr(href)').extract()[0])
for item in response.css('.foo')]
loader.add_value('links', links)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With