Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy: Why extracted strings are in this format?

Tags:

python

scrapy

I'm doing

item['desc'] = site.select('a/text()').extract()

but this will be printed like this

[u'\n                    A mano libera\n                  ']

What must I do to tim and remove strange chars like [u'\n , the traling space and '] ?

I cannot trim (strip)

exceptions.AttributeError: 'list' object has no attribute 'strip'

and if converting to string and then stripping, the result was the string above, which I suppose to be in UTF-8

like image 762
realtebo Avatar asked Jun 08 '13 14:06

realtebo


2 Answers

There's a nice solution to this using Item Loaders. Item Loaders are objects that get data from responses, process the data and build Items for you. Here's an example of an Item Loader that will strip the strings and return the first value that matches the XPath, if any:

from scrapy.contrib.loader import XPathItemLoader
from scrapy.contrib.loader.processor import MapCompose, TakeFirst

class MyItemLoader(XPathItemLoader):
    default_item_class = MyItem
    default_input_processor = MapCompose(lambda string: string.strip())
    default_output_processor = TakeFirst()

And you use it like this:

def parse(self, response):
    loader = MyItemLoader(response=response)
    loader.add_xpath('desc', 'a/text()')
    return loader.load_item()
like image 60
Capi Etheriel Avatar answered Oct 10 '22 19:10

Capi Etheriel


The html page may very well contains these whitespaces characters.

What you retrieve a list of unicode strings, which is why you can't simply call strip on it. If you want to strip these whitespaces characters from each string in this list, you can run the following:

>>> [s.strip() for s in [u'\n                    A mano libera\n                  ']]
[u'A mano libera']

If only the first element matters to you, than simply do:

>>> [u'\n                    A mano libera\n                  '][0].strip()
u'A mano libera'
like image 22
icecrime Avatar answered Oct 10 '22 17:10

icecrime