I'm trying to strip \r \n \t characters with a scrapy spider, making then a json file.
I have a "description" object which is full of new lines, and it doesn't do what I want: matching each description to a title.
I tried with map(unicode.strip()) but it doesn't really works. Being new to scrapy I don't know if there's another simpler way or how map unicode really works.
This is my code:
def parse(self, response):
for sel in response.xpath('//div[@class="d-grid-main"]'):
item = xItem()
item['TITLE'] = sel.xpath('xpath').extract()
item['DESCRIPTION'] = map(unicode.strip, sel.xpath('//p[@class="class-name"]/text()').extract())
I tried also with:
item['DESCRIPTION'] = str(sel.xpath('//p[@class="class-name"]/text()').extract()).strip()
But it raised an error. What's the best way?
Scrapy comes with its own mechanism for extracting data. They're called selectors because they “select” certain parts of the HTML document specified either by XPath or CSS expressions. XPath is a language for selecting nodes in XML documents, which can also be used with HTML.
unicode.strip
only deals with whitespace characters at the beginning and end of strings
Return a copy of the string with the leading and trailing characters removed.
not with \n
, \r
, or \t
in the middle.
You can either use a custom method to remove those characters inside the string (using the regular expression module), or even use XPath's normalize-space()
returns the argument string with whitespace normalized by stripping leading and trailing whitespace and replacing sequences of whitespace characters by a single space.
Example python shell session:
>>> text='''<html> ... <body> ... <div class="d-grid-main"> ... <p class="class-name"> ... ... This is some text, ... with some newlines \r ... and some \t tabs \t too; ... ... <a href="http://example.com"> and a link too ... </a> ... ... I think we're done here ... ... </p> ... </div> ... </body> ... </html>''' >>> response = scrapy.Selector(text=text) >>> response.xpath('//div[@class="d-grid-main"]') [<Selector xpath='//div[@class="d-grid-main"]' data=u'<div class="d-grid-main">\n<p class="clas'>] >>> div = response.xpath('//div[@class="d-grid-main"]')[0] >>> >>> # you'll want to use relative XPath expressions, starting with "./" >>> div.xpath('.//p[@class="class-name"]/text()').extract() [u'\n\n This is some text,\n with some newlines \r\n and some \t tabs \t too;\n\n', u"\n\nI think we're done here\n\n"] >>> >>> # only leading and trailing whitespace is removed by strip() >>> map(unicode.strip, div.xpath('.//p[@class="class-name"]/text()').extract()) [u'This is some text,\n with some newlines \r\n and some \t tabs \t too;', u"I think we're done here"] >>> >>> # normalize-space() will get you a single string on the whole element >>> div.xpath('normalize-space(.//p[@class="class-name"])').extract() [u"This is some text, with some newlines and some tabs too; and a link too I think we're done here"] >>>
I'm a python, scrapy newbie, I've had a similar issue today, solved this with the help of the following module/function w3lib.html.replace_escape_chars I've created a default input processor for my item loader and it works without any issues, you can bind this on the specific scrapy.Field() also, and the good thing it works with css selectors and csv feed exports:
from w3lib.html import replace_escape_chars yourloader.default_input_processor = MapCompose(relace_escape_chars)
As paul trmbrth suggests in his answer,
div.xpath('normalize-space(.//p[@class="class-name"])').extract()
is likely to be what you want. However, normalize-space
also condenses whitespace contained within the string into a single space. If you want only to remove \r
, \n
, and \t
without disturbing the other whitespace you can use translate()
to remove characters.
trans_table = {ord(c): None for c in u'\r\n\t'}
item['DESCRIPTION] = ' '.join(s.translate(trans_table) for s in sel.xpath('//p[@class="class-name"]/text()').extract())
This will still leave leading and trailing whitespace that is not in the set \r
, \n
, or \t
. If you also want to be rid of that just insert a call to strip()
:
item['DESCRIPTION] = ' '.join(s.strip().translate(trans_table) for s in sel.xpath('//p[@class="class-name"]/text()').extract())
The simplest example to extract price from alibris.com is
response.xpath('normalize-space(//td[@class="price"]//p)').get()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With