I'm working with a data set that is ~ 8GB big and I'm also using scikit-learn to train various ML models on it. The data set is basically a list of 1D vectors of ints. How can I make the data set available to multiple python processes or how can I encode the data set so I can make it use <code>multiprocessing</code>'s classes? I've been reading on <code>ctypes</code> and I've also been reading into <code>multiprocessing</code>'s documentation but I'm very confused. I only need to make the data readable to every process so I can train the models with it. Do I need to have the shared <code>multiprocessing</code> variables as ctypes? How can I represent the dataset as <code>ctypes</code>?

I am assuming you are able to load the whole dataset into RAM in a numpy array, and that you are working on Linux or a Mac. (If you are on Windows or you can't fit the array into RAM, then you should probably copy the array to a file on disk and use numpy.memmap to access it. Your computer will cache the data from disk into RAM as well as it can, and those caches will be shared between processes, so it's not a terrible solution.) Under the assumptions above, if you need read-only access to the dataset in other processes created via <code>multiprocessing</code>, you can simply create the dataset and then launch the other processes. They will have read-only access to data from the original namespace. They can alter data from the original namespace, but those changes won't be visible to other processes (the memory manager will copy each segment of memory they alter into the local memory map). If your other processes need to alter the original dataset and make those changes visible to the parent process or other processes, you could use something like this: <pre class="prettyprint"><code>import multiprocessing import numpy as np # create your big dataset big_data = np.zeros((3, 3)) # create a shared-memory wrapper for big_data's underlying data # (it doesn't matter what datatype we use, and 'c' is easiest) # I think if lock=True, you get a serialized object, which you don't want. # Note: you will need to setup your own method to synchronize access to big_data. buf = multiprocessing.Array('c', big_data.data, lock=False) # at this point, buf and big_data.data point to the same block of memory, # (try looking at id(buf[0]) and id(big_data.data[0])) but for some reason # changes aren't propagated between them unless you do the following: big_data.data = buf # now you can update big_data from any process: def add_one_direct(): big_data[:] = big_data + 1 def add_one(a): # People say this won't work, since Process() will pickle the argument. # But in my experience Process() seems to pass the argument via shared # memory, so it works OK. a[:] = a+1 print "starting value:" print big_data p = multiprocessing.Process(target=add_one_direct) p.start() p.join() print "after add_one_direct():" print big_data p = multiprocessing.Process(target=add_one, args=(big_data,)) p.start() p.join() print "after add_one():" print big_data </code></pre>

Python shared read memory

Tags:

python

python-2.7

python-multiprocessing

ctypes

scikit-learn

I'm working with a data set that is ~ 8GB big and I'm also using scikit-learn to train various ML models on it. The data set is basically a list of 1D vectors of ints.

How can I make the data set available to multiple python processes or how can I encode the data set so I can make it use multiprocessing's classes? I've been reading on ctypes and I've also been reading into multiprocessing's documentation but I'm very confused. I only need to make the data readable to every process so I can train the models with it.

Do I need to have the shared multiprocessing variables as ctypes?

How can I represent the dataset as ctypes?

207

asked Aug 07 '16 19:08

Georgi Georgiev

2 Answers

I am assuming you are able to load the whole dataset into RAM in a numpy array, and that you are working on Linux or a Mac. (If you are on Windows or you can't fit the array into RAM, then you should probably copy the array to a file on disk and use numpy.memmap to access it. Your computer will cache the data from disk into RAM as well as it can, and those caches will be shared between processes, so it's not a terrible solution.)

Under the assumptions above, if you need read-only access to the dataset in other processes created via multiprocessing, you can simply create the dataset and then launch the other processes. They will have read-only access to data from the original namespace. They can alter data from the original namespace, but those changes won't be visible to other processes (the memory manager will copy each segment of memory they alter into the local memory map).

If your other processes need to alter the original dataset and make those changes visible to the parent process or other processes, you could use something like this:

import multiprocessing
import numpy as np

# create your big dataset
big_data = np.zeros((3, 3))

# create a shared-memory wrapper for big_data's underlying data
# (it doesn't matter what datatype we use, and 'c' is easiest)
# I think if lock=True, you get a serialized object, which you don't want.
# Note: you will need to setup your own method to synchronize access to big_data.
buf = multiprocessing.Array('c', big_data.data, lock=False)

# at this point, buf and big_data.data point to the same block of memory, 
# (try looking at id(buf[0]) and id(big_data.data[0])) but for some reason
# changes aren't propagated between them unless you do the following:
big_data.data = buf

# now you can update big_data from any process:
def add_one_direct():
    big_data[:] = big_data + 1

def add_one(a):
    # People say this won't work, since Process() will pickle the argument.
    # But in my experience Process() seems to pass the argument via shared
    # memory, so it works OK.
    a[:] = a+1

print "starting value:"
print big_data

p = multiprocessing.Process(target=add_one_direct)
p.start()
p.join()

print "after add_one_direct():"
print big_data

p = multiprocessing.Process(target=add_one, args=(big_data,))
p.start()
p.join()

print "after add_one():"
print big_data

answered Oct 08 '22 09:10

Matthias Fripp

Might be duplicate of Share Large, Read-Only Numpy Array Between Multiprocessing Processes

You could convert your dataset from current representation to new numpy memmap object, and use it from every process. But it won't be very fast anyway, it just gives some abstraction of working with array from ram, in reality it will be file from HDD, partially cached in RAM. So you should prefer scikit-learn algos with partial_fit methods, and use them.

https://docs.scipy.org/doc/numpy/reference/generated/numpy.memmap.html

Actually joblib (which is used in scikit-learn for parallelizing) automatically converts your dataset to memmap representation to use it from different processes (If it's big enough, of course).

answered Oct 08 '22 09:10

Ibraim Ganiev

Related questions
                            
                                Can you have subprocesss.Popen retain color in stdout/stderr?
                            
                                How does hashing work for python sets [duplicate]
                            
                                Iterating through multidimensional lists?
                            
                                Using ols function with parameters that contain numbers/spaces
                            
                                Python - SkLearn Imputer usage
                            
                                Sqlalchemy: Print contents of table
                            
                                Trigonometric identities
                            
                                How to emit dataChanged in PyQt5
                            
                                Is there a way, in Django, to define routes using Flask-style route syntax?
                            
                                how to fetch a field in ConsumerRecord
                            
                                Add legend to networks plot to explain colouring of nodes
                            
                                Compare pandas dataframes for common rows in two dataframes
                            
                                How to perform an operation on every element in a numpy matrix?
                            
                                Pivot table error:1 ndim Categorical are not supported at this time
                            
                                how to get webpage resource content via chrome remote debugging
                            
                                What does self[identifier] = some_value do in this code?
                            
                                AWS Unable to import module 'app' : no module named Pymysql
                            
                                OpenSSL.crypto.X509.sign() throws " 'bytes' object has no attribute 'encode' "
                            
                                python module names with same name as existing modules
                            
                                Translate/Rotate 2D points to change perspective

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With