The script (below) from this tutorial contains two start_urls
.
from scrapy.spider import Spider
from scrapy.selector import Selector
from dirbot.items import Website
class DmozSpider(Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/",
]
def parse(self, response):
"""
The lines below is a spider contract. For more info see:
http://doc.scrapy.org/en/latest/topics/contracts.html
@url http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/
@scrapes name
"""
sel = Selector(response)
sites = sel.xpath('//ul[@class="directory-url"]/li')
items = []
for site in sites:
item = Website()
item['name'] = site.xpath('a/text()').extract()
item['url'] = site.xpath('a/@href').extract()
item['description'] = site.xpath('text()').re('-\s[^\n]*\\r')
items.append(item)
return items
But why does it scrape only these 2 web pages? I see allowed_domains = ["dmoz.org"]
but these two pages also contain links to other pages which are within dmoz.org
domain! Why doesnt it scrape them too?
start_urls
class attribute contains start urls - nothing more. If you have extracted urls of other pages you want to scrape - yield from parse
callback corresponding requests with [another] callback:
class Spider(BaseSpider):
name = 'my_spider'
start_urls = [
'http://www.domain.com/'
]
allowed_domains = ['domain.com']
def parse(self, response):
'''Parse main page and extract categories links.'''
hxs = HtmlXPathSelector(response)
urls = hxs.select("//*[@id='tSubmenuContent']/a[position()>1]/@href").extract()
for url in urls:
url = urlparse.urljoin(response.url, url)
self.log('Found category url: %s' % url)
yield Request(url, callback = self.parseCategory)
def parseCategory(self, response):
'''Parse category page and extract links of the items.'''
hxs = HtmlXPathSelector(response)
links = hxs.select("//*[@id='_list']//td[@class='tListDesc']/a/@href").extract()
for link in links:
itemLink = urlparse.urljoin(response.url, link)
self.log('Found item link: %s' % itemLink, log.DEBUG)
yield Request(itemLink, callback = self.parseItem)
def parseItem(self, response):
...
If you still want to customize start requests creation, override method BaseSpider.start_requests()
start_urls contain those links from which the spider start crawling. If you want crawl recursively you should use crawlspider and define rules for that. http://doc.scrapy.org/en/latest/topics/spiders.html look there for example.
The class does not have a rules
property. Have a look at http://readthedocs.org/docs/scrapy/en/latest/intro/overview.html and search for "rules" to find an example.
If you use BaseSpider
, inside the callback, you have to extract out your desired urls yourself and return a Request
object.
If you use CrawlSpider
, links extraction would be taken care of by the rules and the SgmlLinkExtractor associated with the rules.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With