Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: Pre-loading memory

I have a python program where I need to load and de-serialize a 1GB pickle file. It takes a good 20 seconds and I would like to have a mechanism whereby the content of the pickle is readily available for use. I've looked at shared_memory but all the examples of its use seem to involve numpy and my project doesn't use numpy. What is the easiest and cleanest way to achieve this using shared_memory or otherwise?

This is how I'm loading the data now (on every run):

def load_pickle(pickle_name):
    return pickle.load(open(DATA_ROOT + pickle_name, 'rb'))

I would like to be able to edit the simulation code in between runs without having to reload the pickle. I've been messing around with importlib.reload but it really doesn't seem to work well for a large Python program with many file:

def main():
    data_manager.load_data()
    run_simulation()
    while True:
        try:
            importlib.reload(simulation)
            run_simulation()
        except:
        print(traceback.format_exc())
        print('Press enter to re-run main.py, CTRL-C to exit')
        sys.stdin.readline()
like image 933
etayluz Avatar asked Jun 08 '21 14:06

etayluz


People also ask

How is memory managed in Python?

How is Memory Managed in Python? According to the Python documentation (3.9.0) for memory management, Python's memory management involves a private heap that is used to store your program’s objects and data structures.

Why can't I free up memory in Python?

The reason is that when a block is deemed “free”, that memory is not actually freed back to the operating system. The Python process keeps it allocated and will use it later for new data.

Why is my memory usage so high in Python?

Also keep in mind that the memory usage of the python process may actually be a lot smaller than reported by the OS. In particular calls to free () need not return the memory to the OS (usually this doesn't happen when performing small allocations) so what you see may be the highest peak of memory usage up to the point, not the current usage.

Where the memory is taken from the python private heap?

The memory is taken from the Python private heap. Object domain: intended for allocating memory belonging to Python objects. The memory is taken from the Python private heap. When freeing memory previously allocated by the allocating functions belonging to a given domain,the matching specific deallocating functions must be used.


Video Answer


3 Answers

Adding another assumption-challenging answer, it could be where you're reading your files from that makes a big difference

1G is not a great amount of data with today's systems; at 20 seconds to load, that's only 50MB/s, which is a fraction of what even the slowest disks provide

You may find you actually have a slow disk or some type of network share as your real bottleneck and that changing to a faster storage medium or compressing the data (perhaps with gzip) makes a great difference to read and writing

like image 120
ti7 Avatar answered Oct 19 '22 21:10

ti7


An alternative to storing the unpickled data in memory would be to store the pickle in a ramdisk, so long as most of the time overhead comes from disk reads. Example code (to run in a terminal) is below.

sudo mkdir mnt/pickle
mount -o size=1536M -t tmpfs none /mnt/pickle
cp path/to/pickle.pkl mnt/pickle/pickle.pkl 

Then you can access the pickle at mnt/pickle/pickle.pkl. Note that you can change the file names and extensions to whatever you want. If disk read is not the biggest bottleneck, you might not see a speed increase. If you run out of memory, you can try turning down the size of the ramdisk (I set it at 1536 mb, or 1.5gb)

like image 6
thshea Avatar answered Oct 19 '22 21:10

thshea


You can use shareable list: So you will have 1 python program running which will load the file and save it in memory and another python program which can take the file from memory. Your data, whatever is it you can load it in dictionary and then dump it as json and then reload json. So

Program1

import pickle
import json
from multiprocessing.managers import SharedMemoryManager
YOUR_DATA=pickle.load(open(DATA_ROOT + pickle_name, 'rb'))
data_dict={'DATA':YOUR_DATA}
data_dict_json=json.dumps(data_dict)
smm = SharedMemoryManager()
smm.start() 
sl = smm.ShareableList(['alpha','beta',data_dict_json])
print (sl)
#smm.shutdown() commenting shutdown now but you will need to do it eventually

The output will look like this

#OUTPUT
>>>ShareableList(['alpha', 'beta', "your data in json format"], name='psm_12abcd')

Now in Program2:

from multiprocessing import shared_memory
load_from_mem=shared_memory.ShareableList(name='psm_12abcd')
load_from_mem[1]
#OUTPUT
'beta'
load_from_mem[2]
#OUTPUT
yourdataindictionaryformat


You can look for more over here https://docs.python.org/3/library/multiprocessing.shared_memory.html

like image 3
ibadia Avatar answered Oct 19 '22 21:10

ibadia