Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Strip \n \t \r in scrapy

I'm trying to strip \r \n \t characters with a scrapy spider, making then a json file.

I have a "description" object which is full of new lines, and it doesn't do what I want: matching each description to a title.

I tried with map(unicode.strip()) but it doesn't really works. Being new to scrapy I don't know if there's another simpler way or how map unicode really works.

This is my code:

def parse(self, response):
    for sel in response.xpath('//div[@class="d-grid-main"]'):
        item = xItem()
        item['TITLE'] = sel.xpath('xpath').extract()
        item['DESCRIPTION'] = map(unicode.strip, sel.xpath('//p[@class="class-name"]/text()').extract())

I tried also with:

item['DESCRIPTION'] = str(sel.xpath('//p[@class="class-name"]/text()').extract()).strip()

But it raised an error. What's the best way?

like image 979
Lara M. Avatar asked Feb 09 '16 09:02

Lara M.


People also ask

What is selector in Scrapy?

Scrapy comes with its own mechanism for extracting data. They're called selectors because they “select” certain parts of the HTML document specified either by XPath or CSS expressions. XPath is a language for selecting nodes in XML documents, which can also be used with HTML.


4 Answers

unicode.strip only deals with whitespace characters at the beginning and end of strings

Return a copy of the string with the leading and trailing characters removed.

not with \n, \r, or \t in the middle.

You can either use a custom method to remove those characters inside the string (using the regular expression module), or even use XPath's normalize-space()

returns the argument string with whitespace normalized by stripping leading and trailing whitespace and replacing sequences of whitespace characters by a single space.

Example python shell session:

>>> text='''<html> ... <body> ... <div class="d-grid-main"> ... <p class="class-name"> ...  ...  This is some text, ...  with some newlines \r ...  and some \t tabs \t too; ...  ... <a href="http://example.com"> and a link too ...  </a> ...  ... I think we're done here ...  ... </p> ... </div> ... </body> ... </html>''' >>> response = scrapy.Selector(text=text) >>> response.xpath('//div[@class="d-grid-main"]') [<Selector xpath='//div[@class="d-grid-main"]' data=u'<div class="d-grid-main">\n<p class="clas'>] >>> div = response.xpath('//div[@class="d-grid-main"]')[0] >>>  >>> # you'll want to use relative XPath expressions, starting with "./" >>> div.xpath('.//p[@class="class-name"]/text()').extract() [u'\n\n This is some text,\n with some newlines \r\n and some \t tabs \t too;\n\n',  u"\n\nI think we're done here\n\n"] >>>  >>> # only leading and trailing whitespace is removed by strip() >>> map(unicode.strip, div.xpath('.//p[@class="class-name"]/text()').extract()) [u'This is some text,\n with some newlines \r\n and some \t tabs \t too;', u"I think we're done here"] >>>  >>> # normalize-space() will get you a single string on the whole element >>> div.xpath('normalize-space(.//p[@class="class-name"])').extract() [u"This is some text, with some newlines and some tabs too; and a link too I think we're done here"] >>>  
like image 108
paul trmbrth Avatar answered Sep 18 '22 12:09

paul trmbrth


I'm a python, scrapy newbie, I've had a similar issue today, solved this with the help of the following module/function w3lib.html.replace_escape_chars I've created a default input processor for my item loader and it works without any issues, you can bind this on the specific scrapy.Field() also, and the good thing it works with css selectors and csv feed exports:

from w3lib.html import replace_escape_chars yourloader.default_input_processor = MapCompose(relace_escape_chars) 
like image 31
Peter Húbek Avatar answered Sep 20 '22 12:09

Peter Húbek


As paul trmbrth suggests in his answer,

div.xpath('normalize-space(.//p[@class="class-name"])').extract()

is likely to be what you want. However, normalize-space also condenses whitespace contained within the string into a single space. If you want only to remove \r, \n, and \t without disturbing the other whitespace you can use translate() to remove characters.

trans_table = {ord(c): None for c in u'\r\n\t'}
item['DESCRIPTION] = ' '.join(s.translate(trans_table) for s in sel.xpath('//p[@class="class-name"]/text()').extract())

This will still leave leading and trailing whitespace that is not in the set \r, \n, or \t. If you also want to be rid of that just insert a call to strip():

item['DESCRIPTION] = ' '.join(s.strip().translate(trans_table) for s in sel.xpath('//p[@class="class-name"]/text()').extract())
like image 20
mhawke Avatar answered Sep 22 '22 12:09

mhawke


The simplest example to extract price from alibris.com is

response.xpath('normalize-space(//td[@class="price"]//p)').get()
like image 28
user1994 Avatar answered Sep 18 '22 12:09

user1994