Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it possible to store python objects (specifically sklearn models) in memory mapped files?

I have several large objects (sklearn models) that take up a lot of memory, and I want to share them between several process. is there a way to do this?

  • It has to be the "live" object, and not a serialized version
  • I know that there's a memory mapped version of numpy array, which are responsible for a significant part of the model memory - but using them will require significant changes to the sklearn source code, which would be hard to maintain
like image 730
Ophir Yoktan Avatar asked Feb 23 '16 06:02

Ophir Yoktan


People also ask

Can you save a Sklearn model?

You can save and load the model using the pickle operation to serialize your machine learning algorithms and save the serialized format to a file. Hope it helps!

What is memory mapping Python?

Memory mapping is an alternative approach to file I/O that's available to Python programs through the mmap module. Memory mapping uses lower-level operating system APIs to store file contents directly in physical memory.

Why the Sklearn library used and what are the different classes in it?

Scikit-learn (Sklearn) is the most useful and robust library for machine learning in Python. It provides a selection of efficient tools for machine learning and statistical modeling including classification, regression, clustering and dimensionality reduction via a consistence interface in Python.


1 Answers

Under the proviso that the processes are launched from the same python script, here is an example that creates a second process and shares variables between the two processes. It is straightforward to elaborate on this to create some number of processes. Notice the constructs used to create and access the shared variables and lock. I have inserted a loop over an arithmetic process to generate some cpu usage so that you can monitor and see how this runs on a multi-core or multi-processor platform. Also note the use of a shared variable to control the second process, in this instance to tell it when to exit. And finally, the shared object can be a value or an array, see https://docs.python.org/2/library/multiprocessing.html

#!/usr/bin/python

from time import sleep
from multiprocessing import Process, Value, Lock

def myfunc(counter, lock, run):

    while run.value:
        sleep(1)
        n=0
        for i in range(10000):
            n = n+i*i
        print( n )
        with lock:
            counter.value += 1
            print( "thread %d"%counter.value )

    with lock:
        counter.value = -1
        print( "thread exit %d"%counter.value )

# =======================

counter = Value('i', 0)
run = Value('b', True)
lock = Lock()

p = Process(target=myfunc, args=(counter, lock, run))
p.start()

while counter.value < 5:
    print( "main %d"%counter.value )
    n=0
    for i in range(10000):
        n = n+i*i
    print( n )
    sleep(1)

with lock:
    counter.value = 0

while counter.value < 5:
    print( "main %d"%counter.value )
    sleep(1)

run.value = False

p.join()

print( "main exit %d"%counter.value)
like image 169
DrM Avatar answered Nov 14 '22 23:11

DrM