I have been working on this for a few weeks now and I've read many questions about python memory leak but I just can't figure it out.
I have a file that contains about 7 million lines. For each line, I need to create a dictionary. So this is a list of dictionary that looks like:
[{'a': 2, 'b':1}{'a':1, 'b':2, 'c':1}]
What I am doing is...
list = []
for line in file.readlines():
terms = line.split(" ")
dict = {}
for term in terms:
if term in dict:
dict[term] = dict[term] + 1
else:
dict[term] = 1
list.append(dict.copy())
dict.clear()
file.close()
The problem is that when I run this it always gets killed around the 6000000th line. Originally I was just doing dict = {}
but changed it so I do dict.clear() after reading similar posts, but it didn't improve anything. I know some posts mentioned about circular references and I looked into my code but I didn't think I have that problem.
I doubt that storing 7 million dictionaries in a list can't be handled in Python? I would appreciate any advice on how I can run the whole things without getting killed.
(The version is 2.7.4)
Try:
from collections import Counter
with open('input') as fin:
term_counts = [Counter(line.split()) for line in fin]
I believe this is what you're trying to achieve with your code.
This avoids the .readlines()
loading the file into memory first, utilises Counter
to do the counting and builds the list in one go without faffing around blanking/assign/clearing dictionaries/appending to lists...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With