Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Memory Leak in Scrapy

i wrote the following code to scrape for email addresses (for testing purposes):

import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.selector import Selector
from crawler.items import EmailItem

class LinkExtractorSpider(CrawlSpider):
    name = 'emailextractor'
    start_urls = ['http://news.google.com']

    rules = ( Rule (LinkExtractor(), callback='process_item', follow=True),)

    def process_item(self, response):
        refer = response.url
        items = list()
        for email in Selector(response).re("[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,4}"):

            emailitem = EmailItem()
            emailitem['email'] = email
            emailitem['refer'] = refer
            items.append(emailitem)
        return items

Unfortunately, it seems that references to the Requests are not closed properly, as with the scrapy telnet console, the number of Requests increases by 5k/s. After ~3min and 10k scraped pages, my system starts swapping (8GB RAM). Anyone got an idea what is wrong? I already tried to remove the refer and "copied" the string using

emailitem['email'] = ''.join(email)

without success. After scraping, the items get saved into a BerkeleyDB counting their occurrences (using pipelines), so the references should be gone after that.

What would be the difference between returning a set of items and yielding each item separately?

EDIT:

After quite a while of debugging I found out, that the Requests are not freed, such that I end up with:

$> nc localhost 6023
>>> prefs()
Live References
Request 10344   oldest: 536s ago
>>> from scrapy.utils.trackref import get_oldest
>>> r = get_oldest('Request')
>>> r.url
<GET http://news.google.com>

which is in fact the start url. Anybody knows what the problem is? Where is the missing reference to the Request object?

EDIT2:

After running for ~12 hours on a server (having 64GB RAM), the RAM used is ~16GB (using ps, even if ps is not the right tool for it). The problem is, that the number of crawled pages is going significantly down and the number of scraped items remains 0 since hours:

INFO: Crawled 122902 pages (at 82 pages/min), scraped 3354 items (at 0 items/min)

EDIT3: I did the objgraph analysis which results in the following graph (thanks @Artur Gaspar): Python Objgraph Backlink

It does not seem that I can influence it?

like image 996
Robin Avatar asked May 25 '15 15:05

Robin


3 Answers

The final answer for me was the use of a disk-based queue in conjunction with a working directory as runtime parameter.

This is adding the following code to the settings.py:

DEPTH_PRIORITY = 1 
SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleFifoDiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.FifoMemoryQueue'

afterwards, starting the crawler using the following commandline makes the changes persistent in the given directory:

scrapy crawl {spidername} -s JOBDIR=crawls/{spidername} see scrapy docs for details

The addidtional benefit of this approach is, that the crawl can be paused and resumed at any time. My spider now runs for more than 11 days blocking ~15GB memory (file cache memory for disk FIFO queues)

like image 193
Robin Avatar answered Nov 15 '22 10:11

Robin


If you yield each item separately, the code is executed differently by the Python interpreter: it's not a function anymore, but a generator.

This way, the full list is never created, and each item will have its memory allocated one at the time, when the code that is using the generator asks for the next item.

So, it could be that you don't have a memory leak, you just have a lot of memory being allocated, roughly 10k pages time the memory used by a list for one page.

Of course you still could have a real memory leak though, there are tips for debugging leaks in Scrapy here.

like image 34
Elias Dorneles Avatar answered Nov 15 '22 11:11

Elias Dorneles


I want to point out an update for Robins answer (can't reply in his post yet, low rep).

Make sure you use the new syntax for queues, cause their proporsal is deprecated now. That "s" costed me some days to figure out what was wrong. The new syntax is this:

    DEPTH_PRIORITY = 1 
    SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleFifoDiskQueue'
    SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.FifoMemoryQueue'
like image 4
Andoni Aranguren Avatar answered Nov 15 '22 10:11

Andoni Aranguren