I have a folder contains 7603 files saved by pickle.dump. The average file size is 6.5MB, so the total disk space the files take is about 48GB.
Each file is obtained by pickling a list object, the list has a structure of
[A * 50]
A = [str, int, [92 floats], B * 3]
B = [C * about 6]
C = [str, int, [92 floats]]
The memory of the computer I'm using is 128GB.
However, I cannot load all the files in the folder into memory by this script:
import pickle
import multiprocessing as mp
import sys
from os.path import join
from os import listdir
import os
def one_loader(the_arg):
with open(the_arg, 'rb') as source:
temp_fp = pickle.load(source)
the_hash = the_arg.split('/')[-1]
os.system('top -bn 1 | grep buff >> memory_log')
return (the_hash, temp_fp)
def process_parallel(the_func, the_args):
pool = mp.Pool(25)
result = dict(pool.map(the_func, the_args))
pool.close()
return result
node_list = sys.argv[-1]
db_path = db_path
the_hashes = listdir(db_path)
the_files = [join(db_path, item) for item in the_hashes]
fp_dict = {}
fp_dict = process_parallel(one_loader, the_files)
I have plotted the memory usage as you can see from the script, the memory usage is

I have several confusions about this plot:
4000 files take 25GB disk space, but why they take more than 100GB memory?
After the sudden drop of the memory usage, I received no error, and I can see the script was still running by using top command. But I have completely no idea of what the system was doing, and where are the rest of the memories.
That is just because serialized data takes less space than the space in memory needed to manage the object when running.
Example with a string:
import pickle
with open("foo","wb") as f:
pickle.dump("toto",f)
foo is 14 bytes on the disk (including pickle header or whatever) but in memory it's much bigger:
>>> import sys
>>> sys.getsizeof('toto')
53
for a dictionary it's even worse, because of the hash tables (and other stuff):
import pickle,os,sys
d = {"foo":"bar"}
with open("foo","wb") as f:
pickle.dump(d,f)
print(os.path.getsize("foo"))
print(sys.getsizeof(d))
result:
27
288
so a 1 to 10 ratio.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With