Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best performance for Scrapy

I'm using Scrapy on my dedicated server, I would like to know how get the best performance for my crawler.

Here is my custom settings :

custom_settings = {
    'RETRY_ENABLED': True,
    'DEPTH_LIMIT' : 0,
    'DEPTH_PRIORITY' : 1,
    'LOG_ENABLED' : False,
    'CONCURRENT_REQUESTS_PER_DOMAIN' : 32,
    'CONCURRENT_REQUESTS' : 64,
}

I actually crawl around 200 link/minutes.

Server :

32 Go RAM : DDR4 ECC 2133 MHz
CPU : 4c/8t : 2,2 / 2,6 GHz
like image 234
Pixel Avatar asked Mar 13 '23 22:03

Pixel


1 Answers

1) Use Scrapyd run spiders

2) The default duplicate filter, that is used in scrapy for filtering visited urls, uses a list of url fingerprints – basically sha1 hashes in length of 40 characters that is 77 bytes long in Python 2.7. Lets say you have to scrape a site with 2M of pages, then your duplicates filter list might grow up to 2M * 77b = 154Mb per one Crawler. In order to be able to scrape 300 of such domains simultaneously, you will need 300 * 154Mb = 42G of memory. Fortunately there is another way – Bloom Filter.

3) In production i am using Scrapyd and Scrapy spiders to run on distributed environment

4) IMHO, I would suggest to use smaller commodity machines with scrapyd instance and run spiders instead of using a big giant machine.

5) Distributed crawlers - I have not used it personally.

6) Use Scrapy Debug to find memory management (in log :2015-07-20 20:32:11-0400 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023)

telnet localhost 6023

prefs()

Live References

# scrapy class                 Memory   Time ago
HtmlResponse                        3   oldest:   5s ago
CraigslistItem                    100   oldest:   5s ago
DmozItem                            1   oldest:   0s ago
DmozSpider                          1   oldest:   6s ago
CraigslistSpider                    1   oldest:   5s ago
Request                          3000   oldest: 705s ago
Selector                           14   oldest:   5s ago
like image 86
backtrack Avatar answered Mar 15 '23 12:03

backtrack