I am currently trying to scrape a website for article prices but I have run into a problem (after having somehow solved the problem that the prices were dynamically generated, which was a huge pain).
I am able to receive the prices and the article names without a problem, but every second result for 'price' is "\xa0". I have tried removing it using 'normalize-space()' but to no avail.
My code:
import scrapy
from scrapy import signals
from scrapy.http import TextResponse
from scrapy.xlib.pydispatch import dispatcher
from horni.items import HorniItem
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
from selenium.webdriver.common.keys import Keys
class mySpider(scrapy.Spider):
name = "placeholder"
allowed_domains = ["placeholder.com"]
start_urls = ["https://www.placeholder.com"]
def __init__(self):
self.driver = webdriver.Chrome()
dispatcher.connect(self.spider_closed, signals.spider_closed)
def spider_closed(self, spider):
self.driver.close()
def parse(self, response):
self.driver.get("https://www.placeholder.com")
response = TextResponse(url=self.driver.current_url, body=self.driver.page_source, encoding='utf-8')
for post in response.xpath('//body'):
item = myItem()
item['article_name'] = post.xpath('//a[@class="title-link"]/span/text()').extract()
item['price'] = post.xpath('//p[@class="display-price"]/span]/text()').extract()
yield item
\xa0
is a non-breaking space in Latin1. Replace it like this:
string = string.replace(u'\xa0', u' ')
Update:
You can apply the code as following:
for post in response.xpath('//body'):
item = myItem()
item['article_name'] = post.xpath('//a[@class="title-link"]/span/text()').extract()
item['price'] = post.xpath('//p[@class="display-price"]/span]/text()').extract()
item['price'] = item['price'].replace(u'\xa0', u' ')
if(item['price'].strip()):
yield item
In here you replace the char and then only yield the item if the price is not empty.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With