I'm attempting to write a spider that recursively scrapes an entire site, using scrapy.
However, while it seems to scrape the first page fine, it then finds the links on that page, but doesn't follow them and scrape those pages, which is what I need.
I've created a scrapy project and started writing a spider that looks like this:
# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import Request
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from urlparse import urljoin
class EventsSpider(scrapy.Spider):
name = "events"
allowed_domains = ["www.foo.bar/"]
start_urls = (
'http://www.foo.bar/events/',
)
rules = (
Rule(LinkExtractor(), callback="parse", follow= True),
)
def parse(self, response):
yield {
'url':response.url,
'language':response.xpath('//meta[@name=\'Language\']/@content').extract(),
'description':response.xpath('//meta[@name=\'Description\']/@content').extract(),
}
for url in response.xpath('//a/@href').extract():
if url and not url.startswith('#'):
self.logger.debug(urljoin(response.url, url))
scrapy.http.Request(urljoin(response.url, url))
Then, when running the spider by using scrapy crawl events -o events.json
I get the output in the console of:
2016-05-16 09:50:04 [scrapy] INFO: Spider closed (finished)
PS C:\Projects\foo\src\Scrapy> scrapy crawl events -o .\events.json
2016-05-16 09:54:36 [scrapy] INFO: Scrapy 1.1.0 started (bot: foo)
2016-05-16 09:54:36 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'foo.spiders', 'FEED_URI': '.\\events.json
', 'SPIDER_MODULES': ['foo.spiders'], 'BOT_NAME': 'foo', 'ROBOTSTXT_OBEY': True, 'FEED_FORMAT': 'json'}
2016-05-16 09:54:36 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2016-05-16 09:54:36 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.chunked.ChunkedTransferMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2016-05-16 09:54:36 [scrapy] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2016-05-16 09:54:36 [scrapy] INFO: Enabled item pipelines:
[]
2016-05-16 09:54:36 [scrapy] INFO: Spider opened
2016-05-16 09:54:36 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2016-05-16 09:54:36 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-05-16 09:54:36 [scrapy] DEBUG: Crawled (200) <GET http://www.foo.co.uk/robots.txt> (referer: None)
2016-05-16 09:54:37 [scrapy] DEBUG: Crawled (200) <GET http://www.foo.co.uk/events/> (referer: None)
2016-05-16 09:54:37 [scrapy] DEBUG: Scraped from <200 http://www.foo.co.uk/events/>
{'description': [], 'language': [u'en_UK'], 'url': 'http://www.foo.co.uk/events/'}
2016-05-16 09:54:37 [events] DEBUG: http://www.foo.co.uk/default.aspx
2016-05-16 09:54:37 [events] DEBUG: http://www.foo.co.uk/page/a-z/
2016-05-16 09:54:37 [events] DEBUG: http://www.foo.co.uk/thing/
2016-05-16 09:54:37 [events] DEBUG: http://www.foo.co.uk/other-thing/
2016-05-16 09:54:37 [events] DEBUG: http://www.foo.co.uk/foo-about-us/
2016-05-16 09:54:37 [events] DEBUG: http://www.foo.co.uk/contactus
2016-05-16 09:54:37 [events] DEBUG: http://www.foo.co.uk/events/bar
2016-05-16 09:54:37 [events] DEBUG: http://www.foo.co.uk/events/event
2016-05-16 09:54:37 [events] DEBUG: http://www.foo.co.uk/events/super-cool-party
2016-05-16 09:54:37 [events] DEBUG: http://www.foo.co.uk/events/another-event
2016-05-16 09:54:37 [events] DEBUG: http://www.foo.co.uk/events/more-events
2016-05-16 09:54:37 [events] DEBUG: http://www.foo.co.uk/events/tps-report-convention
...
more links
...
2016-05-16 09:54:37 [events] DEBUG: http://www.foo.co.uk/events/tps-report-convention-two-the-return
2016-05-16 09:54:37 [scrapy] INFO: Closing spider (finished)
2016-05-16 09:54:37 [scrapy] INFO: Stored json feed (1 items) in: .\events.json
2016-05-16 09:54:37 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 524,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 6187,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2016, 5, 16, 8, 54, 37, 271000),
'item_scraped_count': 1,
'log_count/DEBUG': 80,
'log_count/INFO': 8,
'response_received_count': 2,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2016, 5, 16, 8, 54, 36, 913000)}
2016-05-16 09:54:37 [scrapy] INFO: Spider closed (finished)
And then in the events.json file produced by the crawl, the only page that seems to have been scraped is the start url specified at the top of the script, when really I need all of the pages that match /events/ to be scraped instead.
I'm not sure how to proceed on this, so any help on the matter would be greatly appreciated.
Thanks.
You should create a Item object. Also use CrawlSpider that you imported. I made a few changes to your code, try to use it.
import scrapy
from scrapy.http import Request
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from urlparse import urljoin
#todo
from your_project.items import YourItem
class EventsSpider(CrawlSpider):
name = "events"
allowed_domains = ["foo.bar"]
start_urls = [
'http://www.foo.bar/events/',
]
rules = (
Rule(LinkExtractor(), callback='parse_item', follow=True),
)
def parse_item(self, response):
item = YourItem()
item['url'] = response.url
item['language'] = response.xpath('//meta[@name=\'Language\']/@content').extract()
item['description'] = response.xpath('//meta[@name=\'Description\']/@content').extract()
yield item
for url in response.xpath('//a/@href').extract():
if url and not url.startswith('#'):
self.logger.debug(urljoin(response.url, url))
scrapy.http.Request(urljoin(response.url, url))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With