<p>Spider for reference:</p> <pre class="prettyprint"><code>import scrapy from scrapy.spiders import Spider from scrapy.selector import Selector from script.items import ScriptItem class RunSpider(scrapy.Spider): name = "run" allowed_domains = ["stopitrightnow.com"] start_urls = ( 'http://www.stopitrightnow.com/', ) def parse(self, response): for widget in response.xpath('//div[@class="shopthepost-widget"]'): #print widget.extract() item = ScriptItem() item['url'] = widget.xpath('.//a/@href').extract() url = item['url'] #print url yield item </code></pre> <p>When I run this the output in terminal is as follows:</p> <pre class="prettyprint"><code>2015-08-21 14:23:51 [scrapy] DEBUG: Scraped from <200 http://www.stopitrightnow.com/> {'url': []} <div class="shopthepost-widget" data-widget-id="708473"> <script type="text/javascript">!function(d,s,id){var e, p = /^http:/.test(d.location) ? 'http' : 'https';if(!d.getElementById(id)) {e = d.createElement(s);e.id = id;e.src = p + '://' + 'widgets.rewardstyle.com' + '/js/shopthepost.js';d.body.appendChild(e);}if(typeof window.__stp === 'object') if(d.readyState === 'complete') {window.__stp.init();}}(document, 'script', 'shopthepost-script');</script><br> </code></pre> <p>This is the html:</p> <pre class="prettyprint"><code><div class="shopthepost-widget" data-widget-id="708473" data-widget-uid="1"><div id="stp-55d44feabd0eb" class="stp-outer stp-no-controls"> <a class="stp-control stp-left stp-hidden">&lt;</a> <div class="stp-inner" style="width: auto"> <div class="stp-slide" style="left: -0%"> <a href="http://rstyle.me/iA-n/zzhv34c_" target="_blank" rel="nofollow" class="stp-product " data-index="0" style="margin: 0 0px 0 0px"> <span class="stp-help"></span> <img src="//images.rewardstyle.com/img?v=2.13&amp;p=n_24878713"> </a> <a href="http://rstyle.me/iA-n/zzhvw4c_" target="_blank" rel="nofollow" class="stp-product " data-index="1" style="margin: 0 0px 0 0px"> <span class="stp-help"></span> <img src="//images.rewardstyle.com/img?v=2.13&amp;p=n_24878708"> </code></pre> <p>To me it seems to hit a block when trying to activate the Javascript. I am aware that javascript can not run in scrapy but there must be a way of getting to those links. I have looked at selenium but can not get a handle on it. </p> <p>Any and all help welcome.</p>

<p>I've solved it with <code>ScrapyJS</code>. </p> <p>Follow the setup instructions in the official documentation and this answer.</p> <p>Here is the test spider I've used:</p> <pre class="prettyprint"><code># -*- coding: utf-8 -*- import scrapy class TestSpider(scrapy.Spider): name = "run" allowed_domains = ["stopitrightnow.com"] start_urls = ( 'http://www.stopitrightnow.com/', ) def start_requests(self): for url in self.start_urls: yield scrapy.Request(url, meta={ 'splash': { 'endpoint': 'render.html', 'args': {'wait': 0.5} } }) def parse(self, response): for widget in response.xpath('//div[@class="shopthepost-widget"]'): print widget.xpath('.//a/@href').extract() </code></pre> <p>And here is what I've got on the console:</p> <pre class="prettyprint"><code>[u'http://rstyle.me/iA-n/7bk8r4c_', u'http://rstyle.me/iA-n/7bk754c_', u'http://rstyle.me/iA-n/6th5d4c_', u'http://rstyle.me/iA-n/7bm3s4c_', u'http://rstyle.me/iA-n/2xeat4c_', u'http://rstyle.me/iA-n/7bi7f4c_', u'http://rstyle.me/iA-n/66abw4c_', u'http://rstyle.me/iA-n/7bm4j4c_'] [u'http://rstyle.me/iA-n/zzhv34c_', u'http://rstyle.me/iA-n/zzhvw4c_', u'http://rstyle.me/iA-n/zwuvk4c_', u'http://rstyle.me/iA-n/zzhvr4c_', u'http://rstyle.me/iA-n/zzh9g4c_', u'http://rstyle.me/iA-n/zzhz54c_', u'http://rstyle.me/iA-n/zwuuy4c_', u'http://rstyle.me/iA-n/zzhx94c_'] </code></pre>

How can Scrapy deal with Javascript

Tags:

javascript

selenium

web-scraping

scrapy

scrapy-spider

Spider for reference:

import scrapy
from scrapy.spiders import Spider
from scrapy.selector import Selector
from script.items import ScriptItem



    class RunSpider(scrapy.Spider):
        name = "run"
        allowed_domains = ["stopitrightnow.com"]
        start_urls = (
            'http://www.stopitrightnow.com/',
        )



        def parse(self, response):


            for widget in response.xpath('//div[@class="shopthepost-widget"]'):
                #print widget.extract()
                item = ScriptItem()
                item['url'] = widget.xpath('.//a/@href').extract()
                url = item['url']
                #print url
                yield item

When I run this the output in terminal is as follows:

2015-08-21 14:23:51 [scrapy] DEBUG: Scraped from <200 http://www.stopitrightnow.com/>
{'url': []}
<div class="shopthepost-widget" data-widget-id="708473">
<script type="text/javascript">!function(d,s,id){var e, p = /^http:/.test(d.location) ? 'http' : 'https';if(!d.getElementById(id)) {e = d.createElement(s);e.id = id;e.src = p + '://' + 'widgets.rewardstyle.com' + '/js/shopthepost.js';d.body.appendChild(e);}if(typeof window.__stp === 'object') if(d.readyState === 'complete') {window.__stp.init();}}(document, 'script', 'shopthepost-script');</script><br>

This is the html:

<div class="shopthepost-widget" data-widget-id="708473" data-widget-uid="1"><div id="stp-55d44feabd0eb" class="stp-outer stp-no-controls">
    <a class="stp-control stp-left stp-hidden">&lt;</a>
    <div class="stp-inner" style="width: auto">
        <div class="stp-slide" style="left: -0%">
                        <a href="http://rstyle.me/iA-n/zzhv34c_" target="_blank" rel="nofollow" class="stp-product " data-index="0" style="margin: 0 0px 0 0px">
                <span class="stp-help"></span>
                <img src="//images.rewardstyle.com/img?v=2.13&amp;p=n_24878713">
                            </a>
                        <a href="http://rstyle.me/iA-n/zzhvw4c_" target="_blank" rel="nofollow" class="stp-product " data-index="1" style="margin: 0 0px 0 0px">
                <span class="stp-help"></span>
                <img src="//images.rewardstyle.com/img?v=2.13&amp;p=n_24878708">

To me it seems to hit a block when trying to activate the Javascript. I am aware that javascript can not run in scrapy but there must be a way of getting to those links. I have looked at selenium but can not get a handle on it.

Any and all help welcome.

405

asked Aug 21 '15 13:08

Wine.Merchant

1 Answers

I've solved it with ScrapyJS.

Follow the setup instructions in the official documentation and this answer.

Here is the test spider I've used:

# -*- coding: utf-8 -*-
import scrapy


class TestSpider(scrapy.Spider):
    name = "run"
    allowed_domains = ["stopitrightnow.com"]
    start_urls = (
        'http://www.stopitrightnow.com/',
    )

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url, meta={
                'splash': {
                    'endpoint': 'render.html',
                    'args': {'wait': 0.5}
                }
            })

    def parse(self, response):
        for widget in response.xpath('//div[@class="shopthepost-widget"]'):
            print widget.xpath('.//a/@href').extract()

And here is what I've got on the console:

[u'http://rstyle.me/iA-n/7bk8r4c_', u'http://rstyle.me/iA-n/7bk754c_', u'http://rstyle.me/iA-n/6th5d4c_', u'http://rstyle.me/iA-n/7bm3s4c_', u'http://rstyle.me/iA-n/2xeat4c_', u'http://rstyle.me/iA-n/7bi7f4c_', u'http://rstyle.me/iA-n/66abw4c_', u'http://rstyle.me/iA-n/7bm4j4c_']
[u'http://rstyle.me/iA-n/zzhv34c_', u'http://rstyle.me/iA-n/zzhvw4c_', u'http://rstyle.me/iA-n/zwuvk4c_', u'http://rstyle.me/iA-n/zzhvr4c_', u'http://rstyle.me/iA-n/zzh9g4c_', u'http://rstyle.me/iA-n/zzhz54c_', u'http://rstyle.me/iA-n/zwuuy4c_', u'http://rstyle.me/iA-n/zzhx94c_']

114

answered Sep 27 '22 21:09

alecxe

Related questions
                            
                                Why is my scroll eventListener not firing?
                            
                                "Bad value for attribute src on element img: Must be non-empty", for dynamically generated img src
                            
                                get the id of the first div within the clicked li
                            
                                Avoid re-rendering on scroll and increase performance in a React web application
                            
                                with an input checkbox inside a table row, can i add click event to row and still be able to click input
                            
                                Resize image before sending to BASE64 (without using img element)
                            
                                Collapse and Expand Tree structure in Javascript
                            
                                MVC 5 - Validate a specific field on client-side
                            
                                In C#, how can I replace\u0026 with &?
                            
                                What happens when you bind 'this' on an Ember function?
                            
                                Uncaught SyntaxError: Unexpected token o [duplicate]
                            
                                Why javascript array element can't be accessed with dot notation?
                            
                                Facebook Graph API only returns user name
                            
                                How to get the value from datepicker in textfield
                            
                                Selecting second anchor element within lists using protractor
                            
                                How can I catch scrolling of element inside android webview
                            
                                Protractor - Checking if input field has text
                            
                                How do I properly set a close timeout on desktop notifications created by the browser
                            
                                Clicking a checkbox with protractor?
                            
                                JS Regex to remove IMG Tag from String

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With