Scrapy hidden memory leak

Tags:

Background - TLDR: I have a memory leak in my project

Spent a few days looking through the memory leak docs with scrapy and can't find the problem. I'm developing a medium size scrapy project, ~40k requests per day.

I am hosting this using scrapinghub's scheduled runs.

On scrapinghub, for $9 per month, you are essentially given 1 VM, with 1GB of RAM, to run your crawlers.

I've developed a crawler locally and uploaded to scrapinghub, the only problem is that towards the end of the run, I exceed the memory.

Localling setting CONCURRENT_REQUESTS=16 works fine, but leads to exceeding the memory on scrapinghub at the 50% point. When I set CONCURRENT_REQUESTS=4, I exceed the memory at the 95% point, so reducing to 2 should fix the problem, but then my crawler becomes too slow.

The alternative solution, is paying for 2 VM's, to increase the RAM, but I have a feeling that the way I've set up my crawler is causing memory leaks.

For this example, the project will scrape an online retailer. When run locally, my memusage/max is 2.7gb with CONCURRENT_REQUESTS=16.

I will now run through my scrapy structure

Get the total number of pages to scrape
Loop through all these pages using: www.example.com/page={page_num}
On each page, gather information on 48 products
For each of these products, go to their page and get some information
Using that info, call an API directly, for each product
Save these using an item pipeline (locally I write to csv, but not on scrapinghub)

Pipeline

    class Pipeline(object):
        def process_item(self, item, spider):
            item['stock_jsons'] = json.loads(item['stock_jsons'])['subProducts']
            return item

Items

    class mainItem(scrapy.Item):
        date = scrapy.Field()
        url = scrapy.Field()
        active_col_num = scrapy.Field()
        all_col_nums = scrapy.Field()
        old_price = scrapy.Field()
        current_price = scrapy.Field()
        image_urls_full = scrapy.Field()
        stock_jsons = scrapy.Field()
    
    class URLItem(scrapy.Item):
        urls = scrapy.Field()

Main spider

    class ProductSpider(scrapy.Spider):
        name = 'product'
        def __init__(self, **kwargs):
            page = requests.get('www.example.com', headers=headers)
            self.num_pages = # gets the number of pages to search
    
        def start_requests(self):
            for page in tqdm(range(1, self.num_pages+1)):
                url = 'www.example.com/page={page}'
                yield scrapy.Request(url = url, headers=headers, callback = self.prod_url)

        def prod_url(self, response):
            urls_item = URLItem()
            extracted_urls = response.xpath(####).extract() # Gets URLs to follow
            urls_item['urls'] = [# Get a list of urls]
            for url in urls_item['urls']:
                    yield scrapy.Request(url = url, headers=headers, callback = self.parse)

        def parse(self, response) # Parse the main product page
            item = mainItem()
            item['date'] = DATETIME_VAR
            item['url'] = response.url
            item['active_col_num'] = XXX
            item['all_col_nums'] = XXX
            item['old_price'] = XXX
            item['current_price'] = XXX
            item['image_urls_full'] = XXX

            try:
                new_url = 'www.exampleAPI.com/' + item['active_col_num']
            except TypeError:
                new_url = 'www.exampleAPI.com/{dummy_number}'
        
            yield scrapy.Request(new_url, callback=self.parse_attr, meta={'item': item})


        def parse_attr(self, response):
        ## This calls an API Step 5
            item = response.meta['item']
            item['stock_jsons'] = response.text
            yield item

What I've tried so far?

psutils, haven't helped too much.
trackref.print_live_refs() returns the following at the end:

HtmlResponse                       31   oldest: 3s ago
mainItem                            18   oldest: 5s ago
ProductSpider                       1   oldest: 3321s ago
Request                            43   oldest: 105s ago
Selector                           16   oldest: 3s ago

printing the top 10 global variables, over time
printing the top 10 item types, over time

QUESTIONS

How can I find the memory leak?
Can anyone see where I may be leaking memory?
Is there a fundamental problem with my scrapy structure?

Please let me know if there is any more information required

Additional Information Requested

Note, the following output is from my local machine, where I have plenty of RAM, so the website I am scraping becomes the bottleneck. When using scrapinghub, due to the 1GB limit, the suspected memory leak becomes the problem.

Please let me know if the output from scrapinghub is required, I think it should be the same, but the message for finish reason, is memory exceeded.

1.Log lines from start(from INFO: Scrapy xxx started to spider opened).

2020-09-17 11:54:11 [scrapy.utils.log] INFO: Scrapy 2.3.0 started (bot: PLT)
2020-09-17 11:54:11 [scrapy.utils.log] INFO: Versions: lxml 4.5.2.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.7.4 (v3.7.4:e09359112e, Jul  8 2019, 14:54:52) - [Clang 6.0 (clang-600.0.57)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g  21 Apr 2020), cryptography 3.1, Platform Darwin-18.7.0-x86_64-i386-64bit
2020-09-17 11:54:11 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'PLT',
 'CONCURRENT_REQUESTS': 14,
 'CONCURRENT_REQUESTS_PER_DOMAIN': 14,
 'DOWNLOAD_DELAY': 0.05,
 'LOG_LEVEL': 'INFO',
 'NEWSPIDER_MODULE': 'PLT.spiders',
 'SPIDER_MODULES': ['PLT.spiders']}
2020-09-17 11:54:11 [scrapy.extensions.telnet] INFO: Telnet Password: # blocked
2020-09-17 11:54:11 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2020-09-17 11:54:12 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-09-17 11:54:12 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
=======
17_Sep_2020_11_54_12
=======
2020-09-17 11:54:12 [scrapy.middleware] INFO: Enabled item pipelines:
['PLT.pipelines.PltPipeline']
2020-09-17 11:54:12 [scrapy.core.engine] INFO: Spider opened

2.Ending log lines (INFO: Dumping Scrapy stats to end).

2020-09-17 11:16:43 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 15842233,
 'downloader/request_count': 42031,
 'downloader/request_method_count/GET': 42031,
 'downloader/response_bytes': 1108804016,
 'downloader/response_count': 42031,
 'downloader/response_status_count/200': 41999,
 'downloader/response_status_count/403': 9,
 'downloader/response_status_count/404': 1,
 'downloader/response_status_count/504': 22,
 'dupefilter/filtered': 110,
 'elapsed_time_seconds': 3325.171148,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2020, 9, 17, 10, 16, 43, 258108),
 'httperror/response_ignored_count': 10,
 'httperror/response_ignored_status_count/403': 9,
 'httperror/response_ignored_status_count/404': 1,
 'item_scraped_count': 20769,
 'log_count/INFO': 75,
 'memusage/max': 2707484672,
 'memusage/startup': 100196352,
 'request_depth_max': 2,
 'response_received_count': 42009,
 'retry/count': 22,
 'retry/reason_count/504 Gateway Time-out': 22,
 'scheduler/dequeued': 42031,
 'scheduler/dequeued/memory': 42031,
 'scheduler/enqueued': 42031,
 'scheduler/enqueued/memory': 42031,
 'start_time': datetime.datetime(2020, 9, 17, 9, 21, 18, 86960)}
2020-09-17 11:16:43 [scrapy.core.engine] INFO: Spider closed (finished)

what value used for self.num_pages variable?

The site I am scraping has around 20k products, and shows 48 per page. So it goes to the site, see's 20103 products, then divides by 48 (then math.ceil) to get the number of pages.

Adding the output from scraping hub after updating the middleware

downloader/request_bytes    2945159
downloader/request_count    16518
downloader/request_method_count/GET 16518
downloader/response_bytes   3366280619
downloader/response_count   16516
downloader/response_status_count/200    16513
downloader/response_status_count/404    3
dupefilter/filtered 7
elapsed_time_seconds    4805.867308
finish_reason   memusage_exceeded
finish_time 1600567332341
httperror/response_ignored_count    3
httperror/response_ignored_status_count/404 3
item_scraped_count  8156
log_count/ERROR 1
log_count/INFO  94
memusage/limit_reached  1
memusage/max    1074937856
memusage/startup    109555712
request_depth_max   2
response_received_count 16516
retry/count 2
retry/reason_count/504 Gateway Time-out 2
scheduler/dequeued  16518
scheduler/dequeued/disk 16518
scheduler/enqueued  17280
scheduler/enqueued/disk 17280
start_time  1600562526474

909

asked Sep 17 '20 11:09

Hector Haffenden

Video Answer

1 Answers

1.Scheruler queue/Active requests
with self.numpages = 418.

this code lines will create 418 request objects (including -to ask OS to delegate memory to hold 418 objects) and put them into scheduler queue :

for page in tqdm(range(1, self.num_pages+1)):
                url = 'www.example.com/page={page}'
                yield scrapy.Request(url = url, headers=headers, callback = self.prod_url)

each "page" request generate 48 new requests.
each "product page" request generate 1 "api_call" request
each "api_call" request returns item object.
As all requests have equal priority - on the worst case application will require memory to hold ~20000 request/response objects in RAM at once.

In order to exclude this cases priority parameter can be added to scrapy.Request. And probably You will need to change spider configuration to something like this:

    def start_requests(self):
        yield scrapy.Request(url = 'www.example.com/page=1', headers=headers, callback = self.prod_url)

    def prod_url(self, response):
        #get number of page
        next_page_number = int(response.url.split("/page=")[-1] + 1
        #...
        for url in urls_item['urls']:
                yield scrapy.Request(url = url, headers=headers, callback = self.parse, priority = 1)

        if next_page_number < self.num_pages:
            yield scrapy.Request(url = f"www.example.com/page={str(next_page_number)}"

    def parse(self, response) # Parse the main product page
        #....
        try:
            new_url = 'www.exampleAPI.com/' + item['active_col_num']
        except TypeError:
            new_url = 'www.exampleAPI.com/{dummy_number}'
    
        yield scrapy.Request(new_url, callback=self.parse_attr, meta={'item': item}, priority = 2)

With this spider configuration - spider will process product pages of next page only when it finish processing products from previous pages and your application will not receive long queue of requests/response.

2.Http compression

A lot websites compress html code to reduce traffic load.
For example Amazon website compress it's product pages using gzip.
Average size of compressed html of amazon product page ~250Kb
Size of uncompressed html can exceed ~1.5Mb.

In case if Your website use compression and response sizes of uncompressed html is similar to size of amazon product pages - app will require to spend a lot of memory to hold both compressed and uncompressed response bodies. And DownloaderStats middleware that populates downloader/response_bytes stats parameter will not count size of uncompresses responses as it's process_response method called before process_response method of HttpCompressionMiddleware middleware.

In order to check it you will need to change priority of Downloader stats middleware by adding this to settings:

DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.stats.DownloaderStats':50
}

In this case: downloader/request_bytes stats parameter - will be reduced as it will not count size of some headers populated by middlewares.
downloader/response_bytes stats parameter - will be greatly increased in case if website uses compression.

182

answered Oct 19 '22 02:10

Georgiy

Related questions
                            
                                How to fix "WARNING: Hidden import "pygame._view" not found!" when converting .py to .exe using PyInstaller?
                            
                                How can I copy DataFrames with datetimes from Stack Overflow into Python?
                            
                                Can't use Image.putalpha() on a png file from PIL lib. OSError: cannot write mode PA as PNG
                            
                                Write a readable test-case for a diff which includes "\n"
                            
                                Bot only takes one command
                            
                                Python 3.6 type hinting for a function accepting generic class type and instance type of the same generic type
                            
                                How do I make a circular tree with multiple root trees
                            
                                How to implement single sign-on django auth in azure ad?
                            
                                Shift "nan" to the beginning of an array in python [duplicate]
                            
                                To what extent does Google Colab support Python typing?
                            
                                Python Turtle Write Value in Containing Box
                            
                                What form of imports should I use in __main__.py and then how should I run the project?
                            
                                Keras loss and metrics values do not match with same function in each
                            
                                Fill Box Color in Box Plot
                            
                                ERROR: Unable to find py4j, your SPARK_HOME may not be configured correctly
                            
                                TypeError: required field "type_ignores" missing from Module
                            
                                Infinite scroll bar is not working with django
                            
                                Plotting networkx.Graph: how to change node position instead of resetting every node?
                            
                                What is the correct boilerplate for explicit relative imports?
                            
                                Python concurrent.futures Error in atexit._run_exitfuncs: OSError: handle is closed only running in Visual studio Debugging Mode

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Scrapy hidden memory leak

Tags:

python

memory

scrapy

scrapinghub

Hector Haffenden

People also ask

Video Answer

1 Answers

Georgiy

Recent Activity

Donate For Us