Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Being a good citizen and web-scraping

I have a two part question.

First, I'm writing a web-scraper based on the CrawlSpider spider in Scrapy. I'm aiming to scrape a website that has many thousands (possible into the hundreds of thousands) of records. These records are buried 2-3 layers down from the start page. So basically I have the spider start on a certain page, crawl until it finds a specific type of record, and then parse the html. What I'm wondering is what methods exist to prevent my spider from overloading the site? Is there possibly a way to do thing's incrementally or put a pause in between different requests?

Second, and related, is there a method with Scrapy to test a crawler without placing undue stress on a site? I know you can kill the program while it runs, but is there a way to make the script stop after hitting something like the first page that has the information I want to scrape?

Any advice or resources would be greatly appreciated.

like image 456
user1074057 Avatar asked Dec 17 '11 04:12

user1074057


1 Answers

Is there possibly a way to do thing's incrementally

I'm using Scrapy caching ability to scrape site incrementaly

HTTPCACHE_ENABLED = True

Or you can use new 0.14 feature Jobs: pausing and resuming crawls

or put a pause in between different requests?

check this settings:

DOWNLOAD_DELAY    
RANDOMIZE_DOWNLOAD_DELAY

is there a method with Scrapy to test a crawler without placing undue stress on a site?

You can try and debug your code in Scrapy shell

I know you can kill the program while it runs, but is there a way to make the script stop after hitting something like the first page that has the information I want to scrape?

Also, you can call scrapy.shell.inspect_response at any time in your spider.

Any advice or resources would be greatly appreciated.

Scrapy documentation is the best resource.

like image 77
reclosedev Avatar answered Oct 13 '22 16:10

reclosedev