I'm using Scrapy on my dedicated server, I would like to know how get the best performance for my crawler.
Here is my custom settings :
custom_settings = {
'RETRY_ENABLED': True,
'DEPTH_LIMIT' : 0,
'DEPTH_PRIORITY' : 1,
'LOG_ENABLED' : False,
'CONCURRENT_REQUESTS_PER_DOMAIN' : 32,
'CONCURRENT_REQUESTS' : 64,
}
I actually crawl around 200 link/minutes.
Server :
32 Go RAM : DDR4 ECC 2133 MHz
CPU : 4c/8t : 2,2 / 2,6 GHz
1) Use Scrapyd
run spiders
2) The default duplicate filter, that is used in scrapy for filtering visited urls, uses a list of url fingerprints – basically sha1 hashes in length of 40 characters that is 77 bytes long in Python 2.7. Lets say you have to scrape a site with 2M of pages, then your duplicates filter list might grow up to 2M * 77b = 154Mb per one Crawler. In order to be able to scrape 300 of such domains simultaneously, you will need 300 * 154Mb = 42G of memory. Fortunately there is another way – Bloom Filter.
3) In production i am using Scrapyd and Scrapy spiders to run on distributed environment
4) IMHO, I would suggest to use smaller commodity machines with scrapyd instance and run spiders instead of using a big giant machine.
5) Distributed crawlers - I have not used it personally.
6) Use Scrapy Debug to find memory management (in log :2015-07-20 20:32:11-0400 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023)
telnet localhost 6023
prefs()
Live References
# scrapy class Memory Time ago
HtmlResponse 3 oldest: 5s ago
CraigslistItem 100 oldest: 5s ago
DmozItem 1 oldest: 0s ago
DmozSpider 1 oldest: 6s ago
CraigslistSpider 1 oldest: 5s ago
Request 3000 oldest: 705s ago
Selector 14 oldest: 5s ago
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With