How do Scrapy rules work with crawl spider

Tags:

I have hard time to understand scrapy crawl spider rules. I have example that doesn't work as I would like it did, so it can be two things:

I don't understand how rules work.
I formed incorrect regex that prevents me to get results that I need.

OK here it is what I want to do:

I want to write crawl spider that will get all available statistics information from http://www.euroleague.net website. The website page that hosts all the information that I need for the start is here.

Step 1

First step what I am thinking is extract "Seasons" link(s) and fallow it. Here it is HTML/href that I am intending to match (I want to match all links in the "Seasons" section one by one, but I think that it will be easer to have one link as an example):

href="/main/results/by-date?seasoncode=E2001"

And here is a rule/regex that I created for it:

Rule(SgmlLinkExtractor(allow=('by-date\?seasoncode\=E\d+',)),follow=True),

enter image description here

Step 2

When I am brought by spider to the web page http://www.euroleague.net/main/results/by-date?seasoncode=E2001 for the second step I want that spider extracted link(s) from section "Regular season". At this case lets say it should be "Round 1". The HTML/href that I am looking for is:

<a href="/main/results/by-date?seasoncode=E2001&gamenumber=1&phasetypecode=RS"

And rule/regex that I constructed would be:

Rule(SgmlLinkExtractor(allow=('seasoncode\=E\d+\&gamenumber\=\d+\&phasetypecode\=\w+',)),follow=True),

enter image description here

Step 3

Now I reached page (http://www.euroleague.net/main/results/by-date?seasoncode=E2001&gamenumber=1&phasetypecode=RS) I am ready to extract links that leads to the pages that has all the information that I need: I am looking for HTML/href:

href="/main/results/showgame?gamenumber=1&phasetypecode=RS&gamecode=4&seasoncode=E2001#!boxscore"

And my regex that has to follow would be:

Rule(SgmlLinkExtractor(allow=('gamenumber\=\d+\&phasetypecode\=\w+\&gamecode\=\d+\&seasoncode\=E\d+',)),callback='parse_item'),

enter image description here

The problem

I think that crawler should work something like this: That rules crawler is something like a loop. When first link is matched the crawler will follow to the "Step 2" page, than to "step 3" and after that it will extract data. After doing that it will return to "step 1" to match second link and start loop again to the point when there is no links in first step.

What I see from terminal it seems that crawler loops in "Step 1". It loops through all "Step 1" links, but doesn't involves "step 2"/"step 3" rules.

2014-02-28 00:20:31+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?seasoncode=E2000> (referer: http://  www.euroleague.net/main/results/by-date)
2014-02-28 00:20:31+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?seasoncode=E2001> (referer: http://www.euroleague.net/main/results/by-date)
2014-02-28 00:20:31+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?seasoncode=E2002> (referer: http://www.euroleague.net/main/results/by-date)
2014-02-28 00:20:32+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?seasoncode=E2003> (referer: http://www.euroleague.net/main/results/by-date)
2014-02-28 00:20:33+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?seasoncode=E2004> (referer: http://www.euroleague.net/main/results/by-date)

After it loops through all the "Seasons" links it starts with links that I don't see, in any of three steps that I mentioned:

http://www.euroleague.net/main/results/by-date?gamenumber=23&phasetypecode=TS++++++++&seasoncode=E2013

And such link structure you can find only if you loop through all the links in "Step 2" without returning to the "Step 1" starting point.

The question would be: How rules work? Is it working step by step like I am intending it should work with this example or every rule has it's own loop and goes from rule to rule only after it's finished looping through the first rule?

That is how I see it. Of course it could be something wrong with my rules/regex and it is very possible.

And here is all what I am getting from the terminal:

scrapy crawl basketsp_test -o item6.xml -t xml
2014-02-28 01:09:20+0200 [scrapy] INFO: Scrapy 0.20.0 started (bot: basketbase)
2014-02-28 01:09:20+0200 [scrapy] DEBUG: Optional features available: ssl, http11, boto, django
2014-02-28 01:09:20+0200 [scrapy] DEBUG: Overridden settings: {'NEWSPIDER_MODULE': 'basketbase.spiders', 'FEED_FORMAT': 'xml', 'SPIDER_MODULES': ['basketbase.spiders'], 'FEED_URI': 'item6.xml', 'BOT_NAME': 'basketbase'}
2014-02-28 01:09:21+0200 [scrapy] DEBUG: Enabled extensions: FeedExporter, LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-02-28 01:09:21+0200 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-02-28 01:09:21+0200 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-02-28 01:09:21+0200 [scrapy] DEBUG: Enabled item pipelines: Basketpipeline3, Basketpipeline1db
2014-02-28 01:09:21+0200 [basketsp_test] INFO: Spider opened
2014-02-28 01:09:21+0200 [basketsp_test] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-02-28 01:09:21+0200 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-02-28 01:09:21+0200 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-02-28 01:09:21+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date> (referer: None)
2014-02-28 01:09:22+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date)
2014-02-28 01:09:22+0200 [basketsp_test] DEBUG: Filtered duplicate request: <GET http://www.euroleague.net/main/results/by-date?seasoncode=E2013> - no more duplicates will be shown (see DUPEFILTER_CLASS)
2014-02-28 01:09:22+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?seasoncode=E2000> (referer: http://www.euroleague.net/main/results/by-date)
2014-02-28 01:09:23+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?seasoncode=E2001> (referer: http://www.euroleague.net/main/results/by-date)
2014-02-28 01:09:23+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?seasoncode=E2002> (referer: http://www.euroleague.net/main/results/by-date)
2014-02-28 01:09:24+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?seasoncode=E2003> (referer: http://www.euroleague.net/main/results/by-date)
2014-02-28 01:09:24+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?seasoncode=E2004> (referer: http://www.euroleague.net/main/results/by-date)
2014-02-28 01:09:25+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?seasoncode=E2005> (referer: http://www.euroleague.net/main/results/by-date)
2014-02-28 01:09:26+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?seasoncode=E2006> (referer: http://www.euroleague.net/main/results/by-date)
2014-02-28 01:09:26+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?seasoncode=E2007> (referer: http://www.euroleague.net/main/results/by-date)
2014-02-28 01:09:27+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?seasoncode=E2008> (referer: http://www.euroleague.net/main/results/by-date)
2014-02-28 01:09:27+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?seasoncode=E2009> (referer: http://www.euroleague.net/main/results/by-date)
2014-02-28 01:09:28+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?seasoncode=E2010> (referer: http://www.euroleague.net/main/results/by-date)
2014-02-28 01:09:29+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?seasoncode=E2011> (referer: http://www.euroleague.net/main/results/by-date)
2014-02-28 01:09:29+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?seasoncode=E2012> (referer: http://www.euroleague.net/main/results/by-date)
2014-02-28 01:09:30+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=24&phasetypecode=TS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013)
2014-02-28 01:09:30+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=23&phasetypecode=TS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013)
2014-02-28 01:09:31+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=22&phasetypecode=TS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013)
2014-02-28 01:09:32+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=21&phasetypecode=TS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013)
2014-02-28 01:09:32+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=20&phasetypecode=TS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013)
2014-02-28 01:09:33+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=19&phasetypecode=TS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013)
2014-02-28 01:09:34+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=18&phasetypecode=TS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013)
2014-02-28 01:09:34+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=17&phasetypecode=TS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013)
2014-02-28 01:09:35+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=16&phasetypecode=TS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013)
2014-02-28 01:09:35+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=15&phasetypecode=TS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013)
2014-02-28 01:09:36+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=14&phasetypecode=TS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013)
2014-02-28 01:09:37+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=13&phasetypecode=TS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013)
2014-02-28 01:09:37+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=12&phasetypecode=TS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013)
2014-02-28 01:09:38+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=11&phasetypecode=TS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013)
2014-02-28 01:09:39+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=10&phasetypecode=RS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013)
2014-02-28 01:09:39+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=9&phasetypecode=RS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013)
2014-02-28 01:09:40+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=8&phasetypecode=RS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013)
2014-02-28 01:09:40+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=7&phasetypecode=RS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013)
2014-02-28 01:09:41+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=6&phasetypecode=RS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013)
2014-02-28 01:09:42+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=5&phasetypecode=RS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013)
2014-02-28 01:09:42+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=4&phasetypecode=RS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013)
2014-02-28 01:09:43+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=3&phasetypecode=RS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013)
2014-02-28 01:09:44+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=2&phasetypecode=RS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013)
2014-02-28 01:09:44+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=1&phasetypecode=RS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013)
2014-02-28 01:09:44+0200 [basketsp_test] INFO: Closing spider (finished)
2014-02-28 01:09:44+0200 [basketsp_test] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 13663,
     'downloader/request_count': 39,
     'downloader/request_method_count/GET': 39,
     'downloader/response_bytes': 527838,
     'downloader/response_count': 39,
     'downloader/response_status_count/200': 39,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2014, 2, 27, 23, 9, 44, 569579),
     'log_count/DEBUG': 46,
     'log_count/INFO': 3,
     'request_depth_max': 2,
     'response_received_count': 39,
     'scheduler/dequeued': 39,
     'scheduler/dequeued/memory': 39,
     'scheduler/enqueued': 39,
     'scheduler/enqueued/memory': 39,
     'start_time': datetime.datetime(2014, 2, 27, 23, 9, 21, 111255)}
2014-02-28 01:09:44+0200 [basketsp_test] INFO: Spider closed (finished)

And here is a rules part from the crawler:

class Basketspider(CrawlSpider):
    name = "basketsp_test"
    download_delay = 0.5

    allowed_domains = ["www.euroleague.net"]
    start_urls = ["http://www.euroleague.net/main/results/by-date"]
    rules = (
        Rule(SgmlLinkExtractor(allow=('by-date\?seasoncode\=E\d+',)),follow=True),
        Rule(SgmlLinkExtractor(allow=('seasoncode\=E\d+\&gamenumber\=\d+\&phasetypecode\=\w+',)),follow=True),
        Rule(SgmlLinkExtractor(allow=('gamenumber\=\d+\&phasetypecode\=\w+\&gamecode\=\d+\&seasoncode\=E\d+',)),callback='parse_item'),



)

904

asked Feb 27 '14 23:02

Vy.Iv

3 Answers

If you are from china, I have a chinese blog post about this:

别再滥用scrapy CrawlSpider中的follow=True

Let's check out how the rules work under the hood:

def _requests_to_follow(self, response):
    seen = set()
    for n, rule in enumerate(self._rules):
        links = [lnk for lnk in rule.link_extractor.extract_links(response)
                 if lnk not in seen]
        for link in links:
            seen.add(link)
            r = Request(url=link.url, callback=self._response_downloaded)
            yield r

as you can see, when we follow a link, the link in the response is extracted by all the rule using a for loop then add them to a set object.

and all the response will be handled by self._response_downloaded:

def _response_downloaded(self, response):
    rule = self._rules[response.meta['rule']]
    return self._parse_response(response, rule.callback, rule.cb_kwargs, rule.follow)

def _parse_response(self, response, callback, cb_kwargs, follow=True):

    if callback:
        cb_res = callback(response, **cb_kwargs) or ()
        cb_res = self.process_results(response, cb_res)
        for requests_or_item in iterate_spider_output(cb_res):
            yield requests_or_item

    # follow will go back to the rules again
    if follow and self._follow_links:
        for request_or_item in self._requests_to_follow(response):
            yield request_or_item

and it goes back to the self._requests_to_follow(response) again and again.

In summary: enter image description here

answered Oct 12 '22 16:10

宏杰李

You are right, according to the source code before returning each response to the callback function, the crawler loops over the Rules, starting, from the first. You should have it in mind, when you write the rules. For example the following rules:

rules(
        Rule(SgmlLinkExtractor(allow=(r'/items',)), callback='parse_item',follow=True),
        Rule(SgmlLinkExtractor(allow=(r'/items/electronics',)), callback='parse_electronic_item',follow=True),
     )

The second rule will never be applied since all the links will be extracted by the first rule with parse_item callback. The matches for the second rule will be filtered out as duplicates by the scrapy.dupefilter.RFPDupeFilter. You should use deny for correct matching of links:

rules(
        Rule(SgmlLinkExtractor(allow=(r'/items',)), deny=(r'/items/electronics',), callback='parse_item',follow=True),
        Rule(SgmlLinkExtractor(allow=(r'/items/electronics',)), callback='parse_electronic_item',follow=True),
     )

107

answered Oct 12 '22 17:10

user2016508

I would be tempted to use a BaseSpider scraper instead of a crawler. Using a basespider you can have more of a flow of intended request routes instead of finding ALL hrefs on the page and visiting them based on global rules. Use yield Requests() to continue looping through the parent sets of links and callbacks to pass the output object all the way to the end.

From your description:

I think that crawler should work something like this: That rules crawler is something like a loop. When first link is matched the crawler will follow to the "Step 2" page, than to "step 3" and after that it will extract data. After doing that it will return to "step 1" to match second link and start loop again to the point when there is no links in first step.

A request callback stack like this would suit you very well. Since you know the order of the pages and which pages you need to scrape. This also has the added benefit of being able to collect information over multiple pages before returning the output object to be processed.

class Basketspider(BaseSpider, errorLog):
    name = "basketsp_test"
    download_delay = 0.5

    def start_requests(self):

        item = WhateverYourOutputItemIs()
        yield Request("http://www.euroleague.net/main/results/by-date", callback=self.parseSeasonsLinks, meta={'item':item})

    def parseSeaseonsLinks(self, response):

        item = response.meta['item'] 

        hxs = HtmlXPathSelector(response)

        html = hxs.extract()
        roundLinkList = list()

        roundLinkPttern = re.compile(r'http://www\.euroleague\.net/main/results/by-date\?gamenumber=\d+&phasetypecode=RS')

        for (roundLink) in re.findall(roundLinkPttern, html):
            if roundLink not in roundLinkList:
                roundLinkList.append(roundLink)        

        for i in range(len(roundLinkList)):

            #if you wanna output this info in the final item
            item['RoundLink'] = roundLinkList[i]

            # Generate new request for round page
            yield Request(stockpageUrl, callback=self.parseStockItem, meta={'item':item})


    def parseRoundPAge(self, response):

        item = response.meta['item'] 
        #Do whatever you need to do in here call more requests if needed or return item here

        item['Thing'] = 'infoOnPage'
        #....
        #....
        #....

        return  item

answered Oct 12 '22 16:10

JonM

Related questions
                            
                                OpenCV in the cloud
                            
                                Is it idiomatic Python to use an abstract class for event handler callbacks?
                            
                                using google protobuffers reflection in python
                            
                                Software Design and Development Major: Pygame Smudge Trails
                            
                                Cx_Freeze - How to Include Modules
                            
                                Mongodb schema design for polymorphic objects
                            
                                How can I set the rating of a song playing in Rhythmbox 2.96?
                            
                                Boost Python: polymorphic container?
                            
                                Python Delegate Pattern - How to avoid circular reference?
                            
                                Using array.array in Python ctypes
                            
                                How to access Google Cloud Platform Firestore triggers from Python runtime cloud functions
                            
                                Complexity of converting a set to a frozenset in Python
                            
                                Create date range list with pandas
                            
                                ModuleNotFoundError: No module named 'grp' on windows
                            
                                Equivalent of setInterval in python
                            
                                Anaconda Python causing slow terminal startup/prompt
                            
                                Is string interning really useful?
                            
                                How to make python class support item assignment?
                            
                                How to slice a generator object or iterator?
                            
                                How to use Google API credentials json on Heroku?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How do Scrapy rules work with crawl spider

Tags:

python

regex

scrapy

web-crawler

Vy.Iv

People also ask

3 Answers

宏杰李

user2016508

JonM

Recent Activity

Donate For Us