Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is this Python method leaking memory?

This method iterate over a list of terms in the data base, check if the terms are in a the text passed as argument, and if one is, replace it with a link to the search page with the term as parameter.

The number of terms is high (about 100000), so the process is pretty slow, but this is Ok since it is performed as a cron job. However, it causes the script memory consumtion to skyrocket and I can't find why:

class SearchedTerm(models.Model):

[...]

@classmethod
def add_search_links_to_text(cls, string, count=3, queryset=None):
    """
        Take a list of all researched terms and search them in the 
        text. If they exist, turn them into links to the search
        page.

        This process is limited to `count` replacements maximum.

        WARNING: because the sites got different URLS schemas, we don't
        provides direct links, but we inject the {% url %} tag 
        so it must be rendered before display. You can use the `eval`
        tag from `libs` for this. Since they got different namespace as
        well, we enter a generic 'namespace' and delegate to the 
        template to change it with the proper one as well.

        If you have a batch process to do, you can pass a query set
        that will be used instead of getting all searched term at
        each calls.
    """

    found = 0

    terms = queryset or cls.on_site.all()

    # to avoid duplicate searched terms to be replaced twice 
    # keep a list of already linkified content
    # added words we are going to insert with the link so they won't match
    # in case of multi passes
    processed = set((u'video', u'streaming', u'title', 
                     u'search', u'namespace', u'href', u'title', 
                     u'url'))

    for term in terms:

        text = term.text.lower()

        # no small word and make
        # quick check to avoid all the rest of the matching
        if len(text) < 3 or text not in string:
            continue

        if found and cls._is_processed(text, processed):
            continue

        # match the search word with accent, for any case
        # ensure this is not part of a word by including 
        # two 'non-letter' character on both ends of the word
        pattern = re.compile(ur'([^\w]|^)(%s)([^\w]|$)' % text, 
                            re.UNICODE|re.IGNORECASE)

        if re.search(pattern, string):
            found += 1

            # create the link string
            # replace the word in the description 
            # use back references (\1, \2, etc) to preserve the original
            # formatin
            # use raw unicode strings (ur"string" notation) to avoid
            # problems with accents and escaping

            query = '-'.join(term.text.split())
            url = ur'{%% url namespace:static-search "%s" %%}' % query
            replace_with = ur'\1<a title="\2 video streaming" href="%s">\2</a>\3' % url

            string = re.sub(pattern, replace_with, string)

            processed.add(text)

            if found >= 3:
                break

    return string

You'll probably want this code as well:

class SearchedTerm(models.Model):

[...]

@classmethod
def _is_processed(cls, text, processed):
    """
        Check if the text if part of the already processed string
        we don't use `in` the set, but `in ` each strings of the set
        to avoid subtring matching that will destroy the tags.

        This is mainly an utility function so you probably won't use
        it directly.
    """
    if text in processed:
        return True

    return any(((text in string) for string in processed))

I really have only two objects with references that could be the suspects here: terms and processed. But I can't see any reason for them to not being garbage collected.

EDIT:

I think I should say that this method is called inside a Django model method itself. I don't know if it's relevant, but here is the code:

class Video(models.Model):

[...]

def update_html_description(self, links=3, queryset=None):
    """
        Take a list of all researched terms and search them in the 
        description. If they exist, turn them into links to the search
        engine. Put the reset into `html_description`.

        This use `add_search_link_to_text` and has therefor, the same 
        limitations.

        It DOESN'T call save().
    """
    queryset = queryset or SearchedTerm.objects.filter(sites__in=self.sites.all())
    text = self.description or self.title
    self.html_description = SearchedTerm.add_search_links_to_text(text, 
                                                                  links, 
                                                                  queryset)

I can imagine that the automatic Python regex caching eats up some memory. But it should do it only once and the memory consumtion goes up at every call of update_html_description.

The problem is not just that it consumes a lot of memory, the problem is that it does not release it: every calls take about 3% of the ram, eventually filling it up and crashing the script with 'cannot allocate memory'.

like image 431
e-satis Avatar asked Jul 18 '11 21:07

e-satis


2 Answers

The whole queryset is loaded into memory once you call it, that is what will eat up your memory. You want to get chunks of results if the resultset is that large, it might be more hits on the database but it will mean a lot less memory consumption.

like image 147
Giltech Avatar answered Nov 09 '22 19:11

Giltech


I was complety unable to find the cause of the problem, but for now I'm by passing this by isolating the infamous snippet by calling a script (using subprocess) that containes this method call. The memory goes up but of course, goes back to normal after the python process dies.

Talk about dirty.

But that's all I got for now.

like image 25
e-satis Avatar answered Nov 09 '22 20:11

e-satis