I have hard time to understand scrapy crawl spider rules. I have example that doesn't work as I would like it did, so it can be two things:
OK here it is what I want to do:
I want to write crawl spider that will get all available statistics information from http://www.euroleague.net website. The website page that hosts all the information that I need for the start is here.
Step 1
First step what I am thinking is extract "Seasons" link(s) and fallow it. Here it is HTML/href that I am intending to match (I want to match all links in the "Seasons" section one by one, but I think that it will be easer to have one link as an example):
href="/main/results/by-date?seasoncode=E2001"
And here is a rule/regex that I created for it:
Rule(SgmlLinkExtractor(allow=('by-date\?seasoncode\=E\d+',)),follow=True),
Step 2
When I am brought by spider to the web page http://www.euroleague.net/main/results/by-date?seasoncode=E2001 for the second step I want that spider extracted link(s) from section "Regular season". At this case lets say it should be "Round 1". The HTML/href that I am looking for is:
<a href="/main/results/by-date?seasoncode=E2001&gamenumber=1&phasetypecode=RS"
And rule/regex that I constructed would be:
Rule(SgmlLinkExtractor(allow=('seasoncode\=E\d+\&gamenumber\=\d+\&phasetypecode\=\w+',)),follow=True),
Step 3
Now I reached page (http://www.euroleague.net/main/results/by-date?seasoncode=E2001&gamenumber=1&phasetypecode=RS) I am ready to extract links that leads to the pages that has all the information that I need: I am looking for HTML/href:
href="/main/results/showgame?gamenumber=1&phasetypecode=RS&gamecode=4&seasoncode=E2001#!boxscore"
And my regex that has to follow would be:
Rule(SgmlLinkExtractor(allow=('gamenumber\=\d+\&phasetypecode\=\w+\&gamecode\=\d+\&seasoncode\=E\d+',)),callback='parse_item'),
The problem
I think that crawler should work something like this: That rules crawler is something like a loop. When first link is matched the crawler will follow to the "Step 2" page, than to "step 3" and after that it will extract data. After doing that it will return to "step 1" to match second link and start loop again to the point when there is no links in first step.
What I see from terminal it seems that crawler loops in "Step 1". It loops through all "Step 1" links, but doesn't involves "step 2"/"step 3" rules.
2014-02-28 00:20:31+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?seasoncode=E2000> (referer: http:// www.euroleague.net/main/results/by-date)
2014-02-28 00:20:31+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?seasoncode=E2001> (referer: http://www.euroleague.net/main/results/by-date)
2014-02-28 00:20:31+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?seasoncode=E2002> (referer: http://www.euroleague.net/main/results/by-date)
2014-02-28 00:20:32+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?seasoncode=E2003> (referer: http://www.euroleague.net/main/results/by-date)
2014-02-28 00:20:33+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?seasoncode=E2004> (referer: http://www.euroleague.net/main/results/by-date)
After it loops through all the "Seasons" links it starts with links that I don't see, in any of three steps that I mentioned:
http://www.euroleague.net/main/results/by-date?gamenumber=23&phasetypecode=TS++++++++&seasoncode=E2013
And such link structure you can find only if you loop through all the links in "Step 2" without returning to the "Step 1" starting point.
The question would be: How rules work? Is it working step by step like I am intending it should work with this example or every rule has it's own loop and goes from rule to rule only after it's finished looping through the first rule?
That is how I see it. Of course it could be something wrong with my rules/regex and it is very possible.
And here is all what I am getting from the terminal:
scrapy crawl basketsp_test -o item6.xml -t xml
2014-02-28 01:09:20+0200 [scrapy] INFO: Scrapy 0.20.0 started (bot: basketbase)
2014-02-28 01:09:20+0200 [scrapy] DEBUG: Optional features available: ssl, http11, boto, django
2014-02-28 01:09:20+0200 [scrapy] DEBUG: Overridden settings: {'NEWSPIDER_MODULE': 'basketbase.spiders', 'FEED_FORMAT': 'xml', 'SPIDER_MODULES': ['basketbase.spiders'], 'FEED_URI': 'item6.xml', 'BOT_NAME': 'basketbase'}
2014-02-28 01:09:21+0200 [scrapy] DEBUG: Enabled extensions: FeedExporter, LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-02-28 01:09:21+0200 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-02-28 01:09:21+0200 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-02-28 01:09:21+0200 [scrapy] DEBUG: Enabled item pipelines: Basketpipeline3, Basketpipeline1db
2014-02-28 01:09:21+0200 [basketsp_test] INFO: Spider opened
2014-02-28 01:09:21+0200 [basketsp_test] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-02-28 01:09:21+0200 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-02-28 01:09:21+0200 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-02-28 01:09:21+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date> (referer: None)
2014-02-28 01:09:22+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date)
2014-02-28 01:09:22+0200 [basketsp_test] DEBUG: Filtered duplicate request: <GET http://www.euroleague.net/main/results/by-date?seasoncode=E2013> - no more duplicates will be shown (see DUPEFILTER_CLASS)
2014-02-28 01:09:22+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?seasoncode=E2000> (referer: http://www.euroleague.net/main/results/by-date)
2014-02-28 01:09:23+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?seasoncode=E2001> (referer: http://www.euroleague.net/main/results/by-date)
2014-02-28 01:09:23+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?seasoncode=E2002> (referer: http://www.euroleague.net/main/results/by-date)
2014-02-28 01:09:24+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?seasoncode=E2003> (referer: http://www.euroleague.net/main/results/by-date)
2014-02-28 01:09:24+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?seasoncode=E2004> (referer: http://www.euroleague.net/main/results/by-date)
2014-02-28 01:09:25+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?seasoncode=E2005> (referer: http://www.euroleague.net/main/results/by-date)
2014-02-28 01:09:26+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?seasoncode=E2006> (referer: http://www.euroleague.net/main/results/by-date)
2014-02-28 01:09:26+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?seasoncode=E2007> (referer: http://www.euroleague.net/main/results/by-date)
2014-02-28 01:09:27+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?seasoncode=E2008> (referer: http://www.euroleague.net/main/results/by-date)
2014-02-28 01:09:27+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?seasoncode=E2009> (referer: http://www.euroleague.net/main/results/by-date)
2014-02-28 01:09:28+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?seasoncode=E2010> (referer: http://www.euroleague.net/main/results/by-date)
2014-02-28 01:09:29+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?seasoncode=E2011> (referer: http://www.euroleague.net/main/results/by-date)
2014-02-28 01:09:29+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?seasoncode=E2012> (referer: http://www.euroleague.net/main/results/by-date)
2014-02-28 01:09:30+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=24&phasetypecode=TS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013)
2014-02-28 01:09:30+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=23&phasetypecode=TS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013)
2014-02-28 01:09:31+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=22&phasetypecode=TS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013)
2014-02-28 01:09:32+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=21&phasetypecode=TS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013)
2014-02-28 01:09:32+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=20&phasetypecode=TS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013)
2014-02-28 01:09:33+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=19&phasetypecode=TS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013)
2014-02-28 01:09:34+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=18&phasetypecode=TS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013)
2014-02-28 01:09:34+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=17&phasetypecode=TS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013)
2014-02-28 01:09:35+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=16&phasetypecode=TS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013)
2014-02-28 01:09:35+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=15&phasetypecode=TS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013)
2014-02-28 01:09:36+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=14&phasetypecode=TS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013)
2014-02-28 01:09:37+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=13&phasetypecode=TS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013)
2014-02-28 01:09:37+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=12&phasetypecode=TS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013)
2014-02-28 01:09:38+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=11&phasetypecode=TS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013)
2014-02-28 01:09:39+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=10&phasetypecode=RS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013)
2014-02-28 01:09:39+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=9&phasetypecode=RS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013)
2014-02-28 01:09:40+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=8&phasetypecode=RS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013)
2014-02-28 01:09:40+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=7&phasetypecode=RS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013)
2014-02-28 01:09:41+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=6&phasetypecode=RS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013)
2014-02-28 01:09:42+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=5&phasetypecode=RS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013)
2014-02-28 01:09:42+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=4&phasetypecode=RS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013)
2014-02-28 01:09:43+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=3&phasetypecode=RS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013)
2014-02-28 01:09:44+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=2&phasetypecode=RS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013)
2014-02-28 01:09:44+0200 [basketsp_test] DEBUG: Crawled (200) <GET http://www.euroleague.net/main/results/by-date?gamenumber=1&phasetypecode=RS++++++++&seasoncode=E2013> (referer: http://www.euroleague.net/main/results/by-date?seasoncode=E2013)
2014-02-28 01:09:44+0200 [basketsp_test] INFO: Closing spider (finished)
2014-02-28 01:09:44+0200 [basketsp_test] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 13663,
'downloader/request_count': 39,
'downloader/request_method_count/GET': 39,
'downloader/response_bytes': 527838,
'downloader/response_count': 39,
'downloader/response_status_count/200': 39,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2014, 2, 27, 23, 9, 44, 569579),
'log_count/DEBUG': 46,
'log_count/INFO': 3,
'request_depth_max': 2,
'response_received_count': 39,
'scheduler/dequeued': 39,
'scheduler/dequeued/memory': 39,
'scheduler/enqueued': 39,
'scheduler/enqueued/memory': 39,
'start_time': datetime.datetime(2014, 2, 27, 23, 9, 21, 111255)}
2014-02-28 01:09:44+0200 [basketsp_test] INFO: Spider closed (finished)
And here is a rules part from the crawler:
class Basketspider(CrawlSpider):
name = "basketsp_test"
download_delay = 0.5
allowed_domains = ["www.euroleague.net"]
start_urls = ["http://www.euroleague.net/main/results/by-date"]
rules = (
Rule(SgmlLinkExtractor(allow=('by-date\?seasoncode\=E\d+',)),follow=True),
Rule(SgmlLinkExtractor(allow=('seasoncode\=E\d+\&gamenumber\=\d+\&phasetypecode\=\w+',)),follow=True),
Rule(SgmlLinkExtractor(allow=('gamenumber\=\d+\&phasetypecode\=\w+\&gamecode\=\d+\&seasoncode\=E\d+',)),callback='parse_item'),
)
Scrapy is an application framework for writing web spiders that crawl web sites and extract data from them. Scrapy provides a built-in mechanism for extracting data (called selectors) but you can easily use BeautifulSoup (or lxml) instead, if you feel more comfortable working with them.
Spiders are classes which define how a certain site (or a group of sites) will be scraped, including how to perform the crawl (i.e. follow links) and how to extract structured data from their pages (i.e. scraping items).
The key to running scrapy in a python script is the CrawlerProcess class. This is a class of the Crawler module. It provides the engine to run scrapy within a python script. Within the CrawlerProcess class code, python's twisted framework is imported.
Due to the built-in support for generating feed exports in multiple formats, as well as selecting and extracting data from various sources, the performance of Scrapy can be said to be faster than Beautiful Soup. Working with Beautiful Soup can speed up with the help of Multithreading process.
If you are from china, I have a chinese blog post about this:
别再滥用scrapy CrawlSpider中的follow=True
Let's check out how the rules work under the hood:
def _requests_to_follow(self, response):
seen = set()
for n, rule in enumerate(self._rules):
links = [lnk for lnk in rule.link_extractor.extract_links(response)
if lnk not in seen]
for link in links:
seen.add(link)
r = Request(url=link.url, callback=self._response_downloaded)
yield r
as you can see, when we follow a link, the link in the response is extracted by all the rule using a for loop then add them to a set object.
and all the response will be handled by self._response_downloaded
:
def _response_downloaded(self, response):
rule = self._rules[response.meta['rule']]
return self._parse_response(response, rule.callback, rule.cb_kwargs, rule.follow)
def _parse_response(self, response, callback, cb_kwargs, follow=True):
if callback:
cb_res = callback(response, **cb_kwargs) or ()
cb_res = self.process_results(response, cb_res)
for requests_or_item in iterate_spider_output(cb_res):
yield requests_or_item
# follow will go back to the rules again
if follow and self._follow_links:
for request_or_item in self._requests_to_follow(response):
yield request_or_item
and it goes back to the self._requests_to_follow(response)
again and again.
In summary:
You are right, according to the source code before returning each response to the callback function, the crawler loops over the Rules, starting, from the first. You should have it in mind, when you write the rules. For example the following rules:
rules(
Rule(SgmlLinkExtractor(allow=(r'/items',)), callback='parse_item',follow=True),
Rule(SgmlLinkExtractor(allow=(r'/items/electronics',)), callback='parse_electronic_item',follow=True),
)
The second rule will never be applied since all the links will be extracted by the first rule with parse_item callback. The matches for the second rule will be filtered out as duplicates by the scrapy.dupefilter.RFPDupeFilter
. You should use deny for correct matching of links:
rules(
Rule(SgmlLinkExtractor(allow=(r'/items',)), deny=(r'/items/electronics',), callback='parse_item',follow=True),
Rule(SgmlLinkExtractor(allow=(r'/items/electronics',)), callback='parse_electronic_item',follow=True),
)
I would be tempted to use a BaseSpider scraper instead of a crawler. Using a basespider you can have more of a flow of intended request routes instead of finding ALL hrefs on the page and visiting them based on global rules. Use yield Requests() to continue looping through the parent sets of links and callbacks to pass the output object all the way to the end.
From your description:
I think that crawler should work something like this: That rules crawler is something like a loop. When first link is matched the crawler will follow to the "Step 2" page, than to "step 3" and after that it will extract data. After doing that it will return to "step 1" to match second link and start loop again to the point when there is no links in first step.
A request callback stack like this would suit you very well. Since you know the order of the pages and which pages you need to scrape. This also has the added benefit of being able to collect information over multiple pages before returning the output object to be processed.
class Basketspider(BaseSpider, errorLog):
name = "basketsp_test"
download_delay = 0.5
def start_requests(self):
item = WhateverYourOutputItemIs()
yield Request("http://www.euroleague.net/main/results/by-date", callback=self.parseSeasonsLinks, meta={'item':item})
def parseSeaseonsLinks(self, response):
item = response.meta['item']
hxs = HtmlXPathSelector(response)
html = hxs.extract()
roundLinkList = list()
roundLinkPttern = re.compile(r'http://www\.euroleague\.net/main/results/by-date\?gamenumber=\d+&phasetypecode=RS')
for (roundLink) in re.findall(roundLinkPttern, html):
if roundLink not in roundLinkList:
roundLinkList.append(roundLink)
for i in range(len(roundLinkList)):
#if you wanna output this info in the final item
item['RoundLink'] = roundLinkList[i]
# Generate new request for round page
yield Request(stockpageUrl, callback=self.parseStockItem, meta={'item':item})
def parseRoundPAge(self, response):
item = response.meta['item']
#Do whatever you need to do in here call more requests if needed or return item here
item['Thing'] = 'infoOnPage'
#....
#....
#....
return item
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With