I'm trying to strip \r \n \t characters with a scrapy spider, making then a json file. I have a "description" object which is full of new lines, and it doesn't do what I want: matching each description to a title. I tried with map(unicode.strip()) but it doesn't really works. Being new to scrapy I don't know if there's another simpler way or how map unicode really works. This is my code: <pre class="prettyprint"><code>def parse(self, response): for sel in response.xpath('//div[@class="d-grid-main"]'): item = xItem() item['TITLE'] = sel.xpath('xpath').extract() item['DESCRIPTION'] = map(unicode.strip, sel.xpath('//p[@class="class-name"]/text()').extract()) </code></pre> I tried also with: <pre class="prettyprint"><code>item['DESCRIPTION'] = str(sel.xpath('//p[@class="class-name"]/text()').extract()).strip() </code></pre> But it raised an error. What's the best way?

<code>unicode.strip</code> only deals with whitespace characters at the beginning and end of strings <blockquote> Return a copy of the string with the leading and trailing characters removed. </blockquote> not with <code>\n</code>, <code>\r</code>, or <code>\t</code> in the middle. You can either use a custom method to remove those characters inside the string (using the regular expression module), or even use XPath's <code>normalize-space()</code> <blockquote> returns the argument string with whitespace normalized by stripping leading and trailing whitespace and replacing sequences of whitespace characters by a single space. </blockquote> Example python shell session: <pre class="prettyprint"><code>>>> text='''<html> ... <body> ... <div class="d-grid-main"> ... ... ... This is some text, ... with some newlines \r ... and some \t tabs \t too; ... ... <a href="http://example.com"> and a link too ... </a> ... ... I think we're done here ... ... ... </div> ... </body> ... </html>''' >>> response = scrapy.Selector(text=text) >>> response.xpath('//div[@class="d-grid-main"]') [<Selector xpath='//div[@class="d-grid-main"]' data=u'<div class="d-grid-main">\n] >>> div = response.xpath('//div[@class="d-grid-main"]')[0] >>> >>> # you'll want to use relative XPath expressions, starting with "./" >>> div.xpath('.//p[@class="class-name"]/text()').extract() [u'\n\n This is some text,\n with some newlines \r\n and some \t tabs \t too;\n\n', u"\n\nI think we're done here\n\n"] >>> >>> # only leading and trailing whitespace is removed by strip() >>> map(unicode.strip, div.xpath('.//p[@class="class-name"]/text()').extract()) [u'This is some text,\n with some newlines \r\n and some \t tabs \t too;', u"I think we're done here"] >>> >>> # normalize-space() will get you a single string on the whole element >>> div.xpath('normalize-space(.//p[@class="class-name"])').extract() [u"This is some text, with some newlines and some tabs too; and a link too I think we're done here"] >>> </code></pre>

As paul trmbrth suggests in his answer, <pre class="prettyprint"><code>div.xpath('normalize-space(.//p[@class="class-name"])').extract() </code></pre> is likely to be what you want. However, <code>normalize-space</code> also condenses whitespace contained within the string into a single space. If you want only to remove <code>\r</code>, <code>\n</code>, and <code>\t</code> without disturbing the other whitespace you can use <code>translate()</code> to remove characters. <pre class="prettyprint"><code>trans_table = {ord(c): None for c in u'\r\n\t'} item['DESCRIPTION] = ' '.join(s.translate(trans_table) for s in sel.xpath('//p[@class="class-name"]/text()').extract()) </code></pre> This will still leave leading and trailing whitespace that is not in the set <code>\r</code>, <code>\n</code>, or <code>\t</code>. If you also want to be rid of that just insert a call to <code>strip()</code>: <pre class="prettyprint"><code>item['DESCRIPTION] = ' '.join(s.strip().translate(trans_table) for s in sel.xpath('//p[@class="class-name"]/text()').extract()) </code></pre>

The simplest example to extract price from alibris.com is <pre class="prettyprint"><code>response.xpath('normalize-space(//td[@class="price"]//p)').get() </code></pre>

Strip \n \t \r in scrapy

Tags:

python

unicode

scrapy

I'm trying to strip \r \n \t characters with a scrapy spider, making then a json file.

I have a "description" object which is full of new lines, and it doesn't do what I want: matching each description to a title.

I tried with map(unicode.strip()) but it doesn't really works. Being new to scrapy I don't know if there's another simpler way or how map unicode really works.

This is my code:

def parse(self, response):
    for sel in response.xpath('//div[@class="d-grid-main"]'):
        item = xItem()
        item['TITLE'] = sel.xpath('xpath').extract()
        item['DESCRIPTION'] = map(unicode.strip, sel.xpath('//p[@class="class-name"]/text()').extract())

I tried also with:

item['DESCRIPTION'] = str(sel.xpath('//p[@class="class-name"]/text()').extract()).strip()

But it raised an error. What's the best way?

979

asked Feb 09 '16 09:02

Lara M.

4 Answers

unicode.strip only deals with whitespace characters at the beginning and end of strings

Return a copy of the string with the leading and trailing characters removed.

not with \n, \r, or \t in the middle.

You can either use a custom method to remove those characters inside the string (using the regular expression module), or even use XPath's normalize-space()

returns the argument string with whitespace normalized by stripping leading and trailing whitespace and replacing sequences of whitespace characters by a single space.

Example python shell session:

>>> text='''<html> ... <body> ... <div class="d-grid-main"> ... <p class="class-name"> ...  ...  This is some text, ...  with some newlines \r ...  and some \t tabs \t too; ...  ... <a href="http://example.com"> and a link too ...  </a> ...  ... I think we're done here ...  ... </p> ... </div> ... </body> ... </html>''' >>> response = scrapy.Selector(text=text) >>> response.xpath('//div[@class="d-grid-main"]') [<Selector xpath='//div[@class="d-grid-main"]' data=u'<div class="d-grid-main">\n<p class="clas'>] >>> div = response.xpath('//div[@class="d-grid-main"]')[0] >>>  >>> # you'll want to use relative XPath expressions, starting with "./" >>> div.xpath('.//p[@class="class-name"]/text()').extract() [u'\n\n This is some text,\n with some newlines \r\n and some \t tabs \t too;\n\n',  u"\n\nI think we're done here\n\n"] >>>  >>> # only leading and trailing whitespace is removed by strip() >>> map(unicode.strip, div.xpath('.//p[@class="class-name"]/text()').extract()) [u'This is some text,\n with some newlines \r\n and some \t tabs \t too;', u"I think we're done here"] >>>  >>> # normalize-space() will get you a single string on the whole element >>> div.xpath('normalize-space(.//p[@class="class-name"])').extract() [u"This is some text, with some newlines and some tabs too; and a link too I think we're done here"] >>>

108

answered Sep 18 '22 12:09

paul trmbrth

I'm a python, scrapy newbie, I've had a similar issue today, solved this with the help of the following module/function w3lib.html.replace_escape_chars I've created a default input processor for my item loader and it works without any issues, you can bind this on the specific scrapy.Field() also, and the good thing it works with css selectors and csv feed exports:

from w3lib.html import replace_escape_chars yourloader.default_input_processor = MapCompose(relace_escape_chars)

answered Sep 20 '22 12:09

Peter Húbek

As paul trmbrth suggests in his answer,

div.xpath('normalize-space(.//p[@class="class-name"])').extract()

is likely to be what you want. However, normalize-space also condenses whitespace contained within the string into a single space. If you want only to remove \r, \n, and \t without disturbing the other whitespace you can use translate() to remove characters.

trans_table = {ord(c): None for c in u'\r\n\t'}
item['DESCRIPTION] = ' '.join(s.translate(trans_table) for s in sel.xpath('//p[@class="class-name"]/text()').extract())

This will still leave leading and trailing whitespace that is not in the set \r, \n, or \t. If you also want to be rid of that just insert a call to strip():

item['DESCRIPTION] = ' '.join(s.strip().translate(trans_table) for s in sel.xpath('//p[@class="class-name"]/text()').extract())

answered Sep 22 '22 12:09

mhawke

The simplest example to extract price from alibris.com is

response.xpath('normalize-space(//td[@class="price"]//p)').get()

answered Sep 18 '22 12:09

user1994

Related questions
                            
                                RSA encryption in python
                            
                                Problem with iFrames in Selenium
                            
                                How to use Popen to run backgroud process and avoid zombie?
                            
                                Numpy array from cStringIO object and avoiding copies
                            
                                Dependency Injection to modules
                            
                                How to add constant-spaced ticks on axes whose lenghts vary? [Python]
                            
                                Parallel programming with coroutines in Python
                            
                                Get the directory of a Shortcut calling a Python Script
                            
                                In NLTK, how do I get the concordance of a text?
                            
                                Are there more search paths than in sys.path?
                            
                                Best way transform custom XML like syntax
                            
                                Python: Convert Unicode-Hex-String to Unicode
                            
                                Using python to run a C++ program and test it
                            
                                How do I use splines in pythonOCC?
                            
                                python datetime strptime wildcard
                            
                                Optimizing Python Code [closed]
                            
                                How to install pip for Python 2
                            
                                Where can the documentation for python-Levenshtein be found online? [closed]
                            
                                Is there a clever way to get the previous/next item using the Django ORM?
                            
                                How to extract metadata from a image using python?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With