Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy CrawlSpider + Splash: how to follow links through linkextractor?

I have the following code that is partially working,

class ThreadSpider(CrawlSpider):
    name = 'thread'
    allowed_domains = ['bbs.example.com']
    start_urls = ['http://bbs.example.com/diy']

    rules = (
        Rule(LinkExtractor(
            allow=(),
            restrict_xpaths=("//a[contains(text(), 'Next Page')]")
        ),
            callback='parse_item',
            process_request='start_requests',
            follow=True),
    )

def start_requests(self):
    for url in self.start_urls:
        yield SplashRequest(url, self.parse_item, args={'wait': 0.5})

def parse_item(self, response):
    # item parser

the code will run only for start_urls but will not follow the links specified in restricted_xpaths, if i comment out start_requests() method and the line process_request='start_requests', in the rules, it will run and follow links at intended, of course without js rendering.

I have read the two related questions, CrawlSpider with Splash getting stuck after first URL and CrawlSpider with Splash and specifically changed scrapy.Request() to SplashRequest() in the start_requests() method, but that does not seem to work. What is wrong with my code? Thanks,

like image 447
eN_Joy Avatar asked Aug 25 '17 16:08

eN_Joy


2 Answers

I've had a similar issue that seemed specific to integrating Splash with a Scrapy CrawlSpider. It would visit only the start url and then close. The only way I managed to get it to work was to not use the scrapy-splash plugin and instead use the 'process_links' method to preppend the Splash http api url to all of the links scrapy collects. Then I made other adjustments to compensate for the new issues that arise from this method. Here's what I did:

You'need these two tools to put together the splash url and then take it apart if you intend to store it somewhere.

from urllib.parse import urlencode, parse_qs

With the splash url being preppended to every link, scrapy will filter them all out as 'off site domain requests', so we make make 'localhost' the allowed domain.

allowed_domains = ['localhost']
start_urls = ['https://www.example.com/']

However, this poses a problem because then we may end up endlessly crawling the web when we only want to crawl one site. Let's fix this with the LinkExtractor rules. By only scraping links from our desired domain, we get around the offsite request problem.

LinkExtractor(allow=r'(http(s)?://)?(.*\.)?{}.*'.format(r'example.com')),
process_links='process_links',

Here's the process_links method. The dictionary in the urlencode method is where you'll put all of your splash arguments.

def process_links(self, links):
    for link in links:
        if "http://localhost:8050/render.html?&" not in link.url:
            link.url = "http://localhost:8050/render.html?&" + urlencode({'url':link.url,
                                                                          'wait':2.0})
    return links

Finally, to take the url back out of the splash url, use the parse_qs method.

parse_qs(response.url)['url'][0] 

One final note about this approach. You'll notice that I have an '&' in the splash url right at the beginning. (...render.html?&). This makes parsing the splash url to take out the actual url consistent no matter what order you have the arguments when you're using the urlencode method.

like image 71
Hanan 'John' Goldstein Avatar answered Nov 09 '22 14:11

Hanan 'John' Goldstein


Seems to be related to https://github.com/scrapy-plugins/scrapy-splash/issues/92

Personnaly I use dont_process_response=True so response is HtmlResponse (which is required by the code in _request_to_follows).

And I also redefine the _build_request method in my spyder, like so:

def _build_request(self, rule, link):
    r = SplashRequest(url=link.url, callback=self._response_downloaded, args={'wait': 0.5}, dont_process_response=True)
    r.meta.update(rule=rule, link_text=link.text)
    return r 

In the github issues, some users just redefine the _request_to_follow method in their class.

like image 37
head7 Avatar answered Nov 09 '22 13:11

head7