All examples i found of Scrapy talk about how to crawl a single page, pages with the same url schema or all the pages of a website. I need to crawl series of pages A, B, C where in A you got the link to B and so on.. For example the website structure is:
A
----> B
---------> C
D
E
I need to crawl all the C pages, but to get link to C i need to crawl before A and B. Any hints?
Run the code with scrapy crawl spider -o next_page. json and check the result.
We use the CrawlerProcess class to run multiple Scrapy spiders in a process simultaneously. We need to create an instance of CrawlerProcess with the project settings. We need to create an instance of Crawler for the spider if we want to have custom settings for the Spider.
see scrapy Request structure, to crawl such chain you'll have to use the callback parameter like the following:
class MySpider(BaseSpider):
...
# spider starts here
def parse(self, response):
...
# A, D, E are done in parallel, A -> B -> C are done serially
yield Request(url=<A url>,
...
callback=parseA)
yield Request(url=<D url>,
...
callback=parseD)
yield Request(url=<E url>,
...
callback=parseE)
def parseA(self, response):
...
yield Request(url=<B url>,
...
callback=parseB)
def parseB(self, response):
...
yield Request(url=<C url>,
...
callback=parseC)
def parseC(self, response):
...
def parseD(self, response):
...
def parseE(self, response):
...
Here is an example spider I wrote for a project of mine:
from scrapy.contrib.spiders import CrawlSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.http import Request
from yoMamaSpider.items import JokeItem
from yoMamaSpider.striputils import stripcats, stripjokes
import re
class Jokes4UsSpider(CrawlSpider):
name = 'jokes4us'
allowed_domains = ['jokes4us.com']
start_urls = ["http://www.jokes4us.com/yomamajokes/"]
def parse(self, response):
hxs = HtmlXPathSelector(response)
links = hxs.select('//a')
for link in links:
url = ''.join(link.select('./@href').extract())
relevant_urls = re.compile(
'http://www\.jokes4us\.com/yomamajokes/yomamas([a-zA-Z]+)')
if relevant_urls.match(url):
yield Request(url, callback=self.parse_page)
def parse_page(self, response):
hxs = HtmlXPathSelector(response)
categories = stripcats(hxs.select('//title/text()').extract())
joke_area = hxs.select('//p/text()').extract()
for joke in joke_area:
joke = stripjokes(joke)
if len(joke) > 15:
yield JokeItem(joke=joke, categories=categories)
I think the parse method is what you are after: It looks at every link on the start_urls page, it then uses some regex to decide if it is a relevant_url (i.e. a url i would like to scrape), if it is relevant - it scrapes the page using yield Request(url, callback=self.parse_page), which calls the parse_page method.
Is this the kind of thing you are after?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With