Remove/Exclude Non-Breaking Space from Scrapy result

Question

I am currently trying to scrape a website for article prices but I have run into a problem (after having somehow solved the problem that the prices were dynamically generated, which was a huge pain).

I am able to receive the prices and the article names without a problem, but every second result for 'price' is "\xa0". I have tried removing it using 'normalize-space()' but to no avail.

My code:

import scrapy
from scrapy import signals
from scrapy.http import TextResponse
from scrapy.xlib.pydispatch import dispatcher
from horni.items import HorniItem

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
from selenium.webdriver.common.keys import Keys

class mySpider(scrapy.Spider):
    name = "placeholder"
    allowed_domains = ["placeholder.com"]
    start_urls = ["https://www.placeholder.com"]

    def __init__(self):
        self.driver = webdriver.Chrome()
        dispatcher.connect(self.spider_closed, signals.spider_closed)

    def spider_closed(self, spider):
        self.driver.close()

    def parse(self, response):
        self.driver.get("https://www.placeholder.com")
        response = TextResponse(url=self.driver.current_url, body=self.driver.page_source, encoding='utf-8')
        for post in response.xpath('//body'):
            item = myItem()
            item['article_name'] = post.xpath('//a[@class="title-link"]/span/text()').extract()
            item['price'] = post.xpath('//p[@class="display-price"]/span]/text()').extract()
            yield item

cb0 · Accepted Answer

\xa0 is a non-breaking space in Latin1. Replace it like this:

string = string.replace(u'\xa0', u' ')

Update:

You can apply the code as following:

for post in response.xpath('//body'):
    item = myItem()
    item['article_name'] = post.xpath('//a[@class="title-link"]/span/text()').extract()
    item['price'] = post.xpath('//p[@class="display-price"]/span]/text()').extract()
    item['price'] = item['price'].replace(u'\xa0', u' ')
    if(item['price'].strip()):
        yield item

In here you replace the char and then only yield the item if the price is not empty.

Remove/Exclude Non-Breaking Space from Scrapy result

Tags:

python

scrapy

rongon

1 Answers

cb0

Recent Activity

Donate For Us

Remove/Exclude Non-Breaking Space from Scrapy result

Tags:

python

scrapy

rongon

1 Answers

cb0

Related questions

Recent Activity

Donate For Us