How To Remove White Space in Scrapy Spider Data

Question

I am writing my first spider in Scrapy and attempting to follow the documentation. I have implemented ItemLoaders. The spider extracts the data, but the data contains many line returns. I have tried many ways to remove them, but nothing seems to work. The replace_escape_chars utility is supposed to work, but I can't figure out how to use it with the ItemLoader. Also some people use (unicode.strip), but again, I can't seem to get it to work. Some people try to use these in items.py and others in the spider. How can I clean the data of these line returns ( )? My items.py file only contains the item names and field(). The spider code is below:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.loader import XPathItemLoader
from scrapy.utils.markup import replace_escape_chars
from ccpstore.items import Greenhouse

class GreenhouseSpider(BaseSpider):
    name = "greenhouse"
    allowed_domains = ["domain.com"]
    start_urls = [
        "http://www.domain.com",
    ]

    def parse(self, response):
        items = []
        l = XPathItemLoader(item=Greenhouse(), response=response)
        l.add_xpath('name', '//div[@class="product_name"]')
        l.add_xpath('title', '//h1')
        l.add_xpath('usage', '//li[@id="ctl18_ctl00_rptProductAttributes_ctl00_liItem"]')
        l.add_xpath('repeat', '//li[@id="ctl18_ctl00_rptProductAttributes_ctl02_liItem"]')
        l.add_xpath('direction', '//li[@id="ctl18_ctl00_rptProductAttributes_ctl03_liItem"]')
        items.append(l.load_item())

        return items

Steven Almeroth · Accepted Answer

You can use the default_output_processor on the loader and also other processors on individual fields, see title:

from scrapy.spider import BaseSpider
from scrapy.contrib.loader import XPathItemLoader
from scrapy.contrib.loader.processor import Compose, MapCompose
from w3lib.html import replace_escape_chars, remove_tags
from ccpstore.items import Greenhouse

class GreenhouseSpider(BaseSpider):
    name = "greenhouse"
    allowed_domains = ["domain.com"]
    start_urls = ["http://www.domain.com"]

    def parse(self, response):
        l = XPathItemLoader(Greenhouse(), response=response)
        l.default_output_processor = MapCompose(lambda v: v.strip(), replace_escape_chars)
        l.add_xpath('name', '//div[@class="product_name"]')
        l.add_xpath('title', '//h1', Compose(remove_tags))
        l.add_xpath('usage', '//li[@id="ctl18_ctl00_rptProductAttributes_ctl00_liItem"]')
        l.add_xpath('repeat', '//li[@id="ctl18_ctl00_rptProductAttributes_ctl02_liItem"]')
        l.add_xpath('direction', '//li[@id="ctl18_ctl00_rptProductAttributes_ctl03_liItem"]')

        return l.load_item()

Dan Walker · Answer

It turns out that there were also many blank spaces in the data, so combining the answer of Steven with some more research allowed the data to have all tags, line returns and duplicate spaces removed. The working code is below. Note the addition of text() on the loader lines which removes the tags and the split and join processors to remove spaces and line returns.

def parse(self, response):
        items = []
        l = XPathItemLoader(item=Greenhouse(), response=response)
        l.default_input_processor = MapCompose(lambda v: v.split(), replace_escape_chars)
        l.default_output_processor = Join()
        l.add_xpath('title', '//h1/text()')
        l.add_xpath('usage', '//li[@id="ctl18_ctl00_rptProductAttributes_ctl00_liItem"]/text()')
        l.add_xpath('repeat', '//li[@id="ctl18_ctl00_rptProductAttributes_ctl02_liItem"]/text()')
        l.add_xpath('direction', '//li[@id="ctl18_ctl00_rptProductAttributes_ctl03_liItem"]/text()')
        items.append(l.load_item())
        return items

How To Remove White Space in Scrapy Spider Data

Tags:

web-scraping

scrapy

Dan Walker

2 Answers

Steven Almeroth

Dan Walker

Recent Activity

Donate For Us

How To Remove White Space in Scrapy Spider Data

Tags:

web-scraping

scrapy

Dan Walker

2 Answers

Steven Almeroth

Dan Walker

Related questions

Recent Activity

Donate For Us