I'm doing
item['desc'] = site.select('a/text()').extract()
but this will be printed like this
[u'\n A mano libera\n ']
What must I do to tim and remove strange chars like [u'\n , the traling space and '] ?
I cannot trim (strip)
exceptions.AttributeError: 'list' object has no attribute 'strip'
and if converting to string and then stripping, the result was the string above, which I suppose to be in UTF-8
There's a nice solution to this using Item Loaders. Item Loaders are objects that get data from responses, process the data and build Items for you. Here's an example of an Item Loader that will strip the strings and return the first value that matches the XPath, if any:
from scrapy.contrib.loader import XPathItemLoader
from scrapy.contrib.loader.processor import MapCompose, TakeFirst
class MyItemLoader(XPathItemLoader):
default_item_class = MyItem
default_input_processor = MapCompose(lambda string: string.strip())
default_output_processor = TakeFirst()
And you use it like this:
def parse(self, response):
loader = MyItemLoader(response=response)
loader.add_xpath('desc', 'a/text()')
return loader.load_item()
The html page may very well contains these whitespaces characters.
What you retrieve a list of unicode strings, which is why you can't simply call strip
on it. If you want to strip these whitespaces characters from each string in this list, you can run the following:
>>> [s.strip() for s in [u'\n A mano libera\n ']]
[u'A mano libera']
If only the first element matters to you, than simply do:
>>> [u'\n A mano libera\n '][0].strip()
u'A mano libera'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With