Being a good citizen and web-scraping

Question

I have a two part question.

First, I'm writing a web-scraper based on the CrawlSpider spider in Scrapy. I'm aiming to scrape a website that has many thousands (possible into the hundreds of thousands) of records. These records are buried 2-3 layers down from the start page. So basically I have the spider start on a certain page, crawl until it finds a specific type of record, and then parse the html. What I'm wondering is what methods exist to prevent my spider from overloading the site? Is there possibly a way to do thing's incrementally or put a pause in between different requests?

Second, and related, is there a method with Scrapy to test a crawler without placing undue stress on a site? I know you can kill the program while it runs, but is there a way to make the script stop after hitting something like the first page that has the information I want to scrape?

Any advice or resources would be greatly appreciated.

reclosedev · Accepted Answer

Is there possibly a way to do thing's incrementally

I'm using Scrapy caching ability to scrape site incrementaly

HTTPCACHE_ENABLED = True

Or you can use new 0.14 feature Jobs: pausing and resuming crawls

or put a pause in between different requests?

check this settings:

DOWNLOAD_DELAY    
RANDOMIZE_DOWNLOAD_DELAY

is there a method with Scrapy to test a crawler without placing undue stress on a site?

You can try and debug your code in Scrapy shell

I know you can kill the program while it runs, but is there a way to make the script stop after hitting something like the first page that has the information I want to scrape?

Also, you can call scrapy.shell.inspect_response at any time in your spider.

Any advice or resources would be greatly appreciated.

Scrapy documentation is the best resource.

Being a good citizen and web-scraping

Tags:

python

scrapy

screen-scraping

user1074057

1 Answers

reclosedev

Recent Activity

Donate For Us

Being a good citizen and web-scraping

Tags:

python

scrapy

screen-scraping

user1074057

1 Answers

reclosedev

Related questions

Recent Activity

Donate For Us