I have a weird problem. I'm loading a huge file (3.5G) and making a dictionary out of it and do some processing. After everything is finished, my script doesn't terminate immediately, it terminates after some time. I think it might be due to memory freeing , what can be other reasons ?? I'd appreciate any opinion. And how can I make my script run faster?
Here's the corresponding code:
class file_processor:
def __init__(self):
self.huge_file_dict = self.upload_huge_file()
def upload_huge_file(self):
d = defaultdict(list)
f= codecs.open('huge_file', 'r', encoding='utf-8').readlines()
for line in f:
l = line.strip()
x,y,z,rb,t = l.split()
d[rb].append((x,y,z,t))
return d
def do_some_processing(self, word):
if word in self.huge_file_dict:
do sth with self.huge_file_dict[word]
My guess is that your horrible slowdown, which doesn't recover until after your program is finished, is caused by using more memory than you actually have, which causes your OS to start swapping VM pages in and out to disk. Once you get enough swapping happening, you end up in "swap hell", where a large percentage of your memory accesses involve a disk read and even a disk write, which takes orders of magnitude more time, and your system won't recover until a few seconds after you finally free up all that memory.
The obvious solution is to not use so much memory.
tzaman's answer, avoiding readlines()
, will eliminate some of that memory. A giant list of all the lines in a 3.5GB file has to take at least 3.5GB on Python 3.4 or 2.7 (but realistically at least 20% more than that) and maybe 2x or 4x on 3.0-3.3.
But the dict is going to be even bigger than the list, and you need that, right?
Well, no, you probably don't. Keeping the dict on-disk and fetching the values as-needed may sound slow, but it may still be a lot faster than keeping it in virtual memory, if that virtual memory has to keep swapping back and forth to disk.
You may want to consider using a simple dbm
, or a more powerful key-value database (google "NoSQL key value" for some options), or a sqlite3
database, or even a server-based SQL database like MySQL.
Alternatively, if you can keep everything in memory, but in a more compact form, that's the best of both worlds.
I notice that in your example code, the only thing you're doing with the dict is checking word in self.huge_file_dict
. If that's true, then you can use a set
instead of a dict
and not keep all those values around in memory. That should cut your memory use by about 80%.
If you frequently need the keys, but occasionally need the values, you might want to consider a dict that just maps the keys to indices into something you can read off disk as needed (e.g., a file with fixed-length strings, which you can then mmap
and slice).
Or you could stick the values in a Pandas frame, which will be a little more compact than native Python storage—maybe enough to make the difference—and use a dict mapping keys to indices.
Finally, you may be able to reduce the amount of swapping without actually reducing the amount of memory. Bisecting a giant sorted list, instead of accessing a giant dict, may—depending on the pattern of your words—give much better memory locality.
Don't call .readlines()
-- that loads the entire file into memory beforehand. You can just iterate over f
directly and it'll work fine.
with codecs.open('huge_file', 'r', encoding='utf-8') as f:
for line in f:
...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With