Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python gets Killed (probably memory leak)

Tags:

python

I have been working on this for a few weeks now and I've read many questions about python memory leak but I just can't figure it out.

I have a file that contains about 7 million lines. For each line, I need to create a dictionary. So this is a list of dictionary that looks like:

[{'a': 2, 'b':1}{'a':1, 'b':2, 'c':1}]

What I am doing is...

list = []
for line in file.readlines():
    terms = line.split(" ")
    dict = {}
    for term in terms:
        if term in dict:
            dict[term] = dict[term] + 1
        else:
            dict[term] = 1
    list.append(dict.copy())
    dict.clear()
file.close()

The problem is that when I run this it always gets killed around the 6000000th line. Originally I was just doing dict = {} but changed it so I do dict.clear() after reading similar posts, but it didn't improve anything. I know some posts mentioned about circular references and I looked into my code but I didn't think I have that problem.

I doubt that storing 7 million dictionaries in a list can't be handled in Python? I would appreciate any advice on how I can run the whole things without getting killed.

(The version is 2.7.4)

like image 486
kabichan Avatar asked Jul 20 '13 19:07

kabichan


1 Answers

Try:

from collections import Counter
with open('input') as fin:
    term_counts = [Counter(line.split()) for line in fin]

I believe this is what you're trying to achieve with your code.

This avoids the .readlines() loading the file into memory first, utilises Counter to do the counting and builds the list in one go without faffing around blanking/assign/clearing dictionaries/appending to lists...

like image 72
Jon Clements Avatar answered Oct 11 '22 11:10

Jon Clements