Spider for reference:
import scrapy
from scrapy.spiders import Spider
from scrapy.selector import Selector
from script.items import ScriptItem
class RunSpider(scrapy.Spider):
name = "run"
allowed_domains = ["stopitrightnow.com"]
start_urls = (
'http://www.stopitrightnow.com/',
)
def parse(self, response):
for widget in response.xpath('//div[@class="shopthepost-widget"]'):
#print widget.extract()
item = ScriptItem()
item['url'] = widget.xpath('.//a/@href').extract()
url = item['url']
#print url
yield item
When I run this the output in terminal is as follows:
2015-08-21 14:23:51 [scrapy] DEBUG: Scraped from <200 http://www.stopitrightnow.com/>
{'url': []}
<div class="shopthepost-widget" data-widget-id="708473">
<script type="text/javascript">!function(d,s,id){var e, p = /^http:/.test(d.location) ? 'http' : 'https';if(!d.getElementById(id)) {e = d.createElement(s);e.id = id;e.src = p + '://' + 'widgets.rewardstyle.com' + '/js/shopthepost.js';d.body.appendChild(e);}if(typeof window.__stp === 'object') if(d.readyState === 'complete') {window.__stp.init();}}(document, 'script', 'shopthepost-script');</script><br>
This is the html:
<div class="shopthepost-widget" data-widget-id="708473" data-widget-uid="1"><div id="stp-55d44feabd0eb" class="stp-outer stp-no-controls">
<a class="stp-control stp-left stp-hidden"><</a>
<div class="stp-inner" style="width: auto">
<div class="stp-slide" style="left: -0%">
<a href="http://rstyle.me/iA-n/zzhv34c_" target="_blank" rel="nofollow" class="stp-product " data-index="0" style="margin: 0 0px 0 0px">
<span class="stp-help"></span>
<img src="//images.rewardstyle.com/img?v=2.13&p=n_24878713">
</a>
<a href="http://rstyle.me/iA-n/zzhvw4c_" target="_blank" rel="nofollow" class="stp-product " data-index="1" style="margin: 0 0px 0 0px">
<span class="stp-help"></span>
<img src="//images.rewardstyle.com/img?v=2.13&p=n_24878708">
To me it seems to hit a block when trying to activate the Javascript. I am aware that javascript can not run in scrapy but there must be a way of getting to those links. I have looked at selenium but can not get a handle on it.
Any and all help welcome.
Some webpages show the desired data when you load them in a web browser. However, when you download them using Scrapy, you cannot reach the desired data using selectors. When this happens, the recommended approach is to find the data source and extract the data from it.
Scrapy is written with Twisted, a popular event-driven networking framework for Python. Thus, it's implemented using a non-blocking (aka asynchronous) code for concurrency.So if we want to scrape the dynamic website we have to use selenium driver or other webdriver.
Scrapy is incredibly fast. Its ability to send asynchronous requests makes it hands-down faster than BeautifulSoup. This means that you'll be able to scrape and extract data from many pages at once. BeautifulSoup doesn't have the means to crawl and scrape pages by itself.
I've solved it with ScrapyJS
.
Follow the setup instructions in the official documentation and this answer.
Here is the test spider I've used:
# -*- coding: utf-8 -*-
import scrapy
class TestSpider(scrapy.Spider):
name = "run"
allowed_domains = ["stopitrightnow.com"]
start_urls = (
'http://www.stopitrightnow.com/',
)
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url, meta={
'splash': {
'endpoint': 'render.html',
'args': {'wait': 0.5}
}
})
def parse(self, response):
for widget in response.xpath('//div[@class="shopthepost-widget"]'):
print widget.xpath('.//a/@href').extract()
And here is what I've got on the console:
[u'http://rstyle.me/iA-n/7bk8r4c_', u'http://rstyle.me/iA-n/7bk754c_', u'http://rstyle.me/iA-n/6th5d4c_', u'http://rstyle.me/iA-n/7bm3s4c_', u'http://rstyle.me/iA-n/2xeat4c_', u'http://rstyle.me/iA-n/7bi7f4c_', u'http://rstyle.me/iA-n/66abw4c_', u'http://rstyle.me/iA-n/7bm4j4c_']
[u'http://rstyle.me/iA-n/zzhv34c_', u'http://rstyle.me/iA-n/zzhvw4c_', u'http://rstyle.me/iA-n/zwuvk4c_', u'http://rstyle.me/iA-n/zzhvr4c_', u'http://rstyle.me/iA-n/zzh9g4c_', u'http://rstyle.me/iA-n/zzhz54c_', u'http://rstyle.me/iA-n/zwuuy4c_', u'http://rstyle.me/iA-n/zzhx94c_']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With