Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove/Exclude Non-Breaking Space from Scrapy result

Tags:

python

scrapy

I am currently trying to scrape a website for article prices but I have run into a problem (after having somehow solved the problem that the prices were dynamically generated, which was a huge pain).

I am able to receive the prices and the article names without a problem, but every second result for 'price' is "\xa0". I have tried removing it using 'normalize-space()' but to no avail.

My code:

import scrapy
from scrapy import signals
from scrapy.http import TextResponse
from scrapy.xlib.pydispatch import dispatcher
from horni.items import HorniItem

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
from selenium.webdriver.common.keys import Keys

class mySpider(scrapy.Spider):
    name = "placeholder"
    allowed_domains = ["placeholder.com"]
    start_urls = ["https://www.placeholder.com"]

    def __init__(self):
        self.driver = webdriver.Chrome()
        dispatcher.connect(self.spider_closed, signals.spider_closed)

    def spider_closed(self, spider):
        self.driver.close()

    def parse(self, response):
        self.driver.get("https://www.placeholder.com")
        response = TextResponse(url=self.driver.current_url, body=self.driver.page_source, encoding='utf-8')
        for post in response.xpath('//body'):
            item = myItem()
            item['article_name'] = post.xpath('//a[@class="title-link"]/span/text()').extract()
            item['price'] = post.xpath('//p[@class="display-price"]/span]/text()').extract()
            yield item
like image 885
rongon Avatar asked Jun 24 '16 09:06

rongon


1 Answers

\xa0 is a non-breaking space in Latin1. Replace it like this:

string = string.replace(u'\xa0', u' ')

Update:

You can apply the code as following:

for post in response.xpath('//body'):
    item = myItem()
    item['article_name'] = post.xpath('//a[@class="title-link"]/span/text()').extract()
    item['price'] = post.xpath('//p[@class="display-price"]/span]/text()').extract()
    item['price'] = item['price'].replace(u'\xa0', u' ')
    if(item['price'].strip()):
        yield item

In here you replace the char and then only yield the item if the price is not empty.

like image 163
cb0 Avatar answered Nov 19 '22 08:11

cb0