Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Preserving numpy view when pickling

By default, pickling a numpy view array loses the view relationship, even if the array base is pickled too. My situation is that I have some complex container objects which are pickled. And in some cases, some contained data are views in some others. Saving a independent array of each view is not only a loss of space but also, the reloaded data have lost the view relationship.

A simple example would be (but in my case the container are more complex than a dictionary):

import numpy as np
import cPickle

tmp = np.zeros(2)
d1 = dict(a=tmp,b=tmp[:])    # d1 to be saved: b is a view on a

pickled = cPickle.dumps(d1)
d2 = cPickle.loads(pickled)  # d2 reloaded copy of d1 container

print 'd1 before:', d1
d1['b'][:] = 1
print 'd1 after: ', d1

print 'd2 before:', d2
d2['b'][:] = 1
print 'd2 after: ', d2

which would print:

d1 before: {'a': array([ 0.,  0.]), 'b': array([ 0.,  0.])}
d1 after:  {'a': array([ 1.,  1.]), 'b': array([ 1.,  1.])}
d2 before: {'a': array([ 0.,  0.]), 'b': array([ 0.,  0.])}
d2 after:  {'a': array([ 0.,  0.]), 'b': array([ 1.,  1.])} # not a view anymore

My question:

(1) Is there a way to preserve it? (2) (even better) is there a way to do it only if the base is pickled

For the (1) I think there may be some way by changing the __setstate__, __reduce_ex_, etc... of the view array. But I don't fill confident with these for now. For the (2) I have no idea.

like image 573
Juh_ Avatar asked Dec 06 '12 15:12

Juh_


People also ask

Can Pickle save NumPy array?

Saving and Loading NumPy ArraysAlthough these arrays can be pickled, it is strongly advised to leverage NumPy's file-IO functions. NumPy's standard binary file type used to store array data is known as an '. npy' file.

How to unpickle data Python?

As we said earlier, the load() method can be used to unpickle the pickled Python object. You have to first open the pickled file using rb (read-binary) permission and pass the opened file to the load() method, as shown below. The load() method unpickles the data and returns the actual object.

What is picklable in Python?

Pickle in Python is primarily used in serializing and deserializing a Python object structure. In other words, it's the process of converting a Python object into a byte stream to store it in a file/database, maintain program state across sessions, or transport data over the network.

Which of the following is correct NumPy code to save array data on disk NPY )?

save() numpy. save() function is used to store the input array in a disk file with npy extension(. npy).


1 Answers

This isn't done in NumPy proper, because it doesn't always make sense to pickle the base array, and pickle does not expose the ability to check if another object is also being pickled as part of its API.

But this sort of check can be done in a custom container for NumPy arrays. For example:

import numpy as np
import pickle

def byte_offset(array, source):
    return array.__array_interface__['data'][0] - np.byte_bounds(source)[0]

class SharedPickleList(object):
    def __init__(self, arrays):
        self.arrays = list(arrays)

    def __getstate__(self):
        unique_ids = {id(array) for array in self.arrays}
        source_arrays = {}
        view_tuples = {}
        for array in self.arrays:
            if array.base is None or id(array.base) not in unique_ids:
                # only use views if the base is also being pickled
                source_arrays[id(array)] = array
            else:
                view_tuples[id(array)] = (array.shape,
                                          array.dtype,
                                          id(array.base),
                                          byte_offset(array, array.base),
                                          array.strides)
        order = [id(array) for array in self.arrays]
        return (source_arrays, view_tuples, order)

    def __setstate__(self, state):
        source_arrays, view_tuples, order = state
        view_arrays = {}
        for k, view_state in view_tuples.items():
            (shape, dtype, source_id, offset, strides) = view_state
            buffer = source_arrays[source_id].data
            array = np.ndarray(shape, dtype, buffer, offset, strides)
            view_arrays[k] = array
        self.arrays = [source_arrays[i]
                       if i in source_arrays
                       else view_arrays[i]
                       for i in order]

# unit tests
def check_roundtrip(arrays):
    unpickled_arrays = pickle.loads(pickle.dumps(
        SharedPickleList(arrays))).arrays
    assert all(a.shape == b.shape and (a == b).all()
               for a, b in zip(arrays, unpickled_arrays))

indexers = [0, None, slice(None), slice(2), slice(None, -1),
            slice(None, None, -1), slice(None, 6, 2)]

source0 = np.random.randint(100, size=10)
arrays0 = [np.asarray(source0[k1]) for k1 in indexers]
check_roundtrip([source0] + arrays0)

source1 = np.random.randint(100, size=(8, 10))
arrays1 = [np.asarray(source1[k1, k2]) for k1 in indexers for k2 in indexers]
check_roundtrip([source1] + arrays1)

This results in significant space savings:

source = np.random.rand(1000)
arrays = [source] + [source[n:] for n in range(99)]
print(len(pickle.dumps(arrays, protocol=-1)))
# 766372
print(len(pickle.dumps(SharedPickleList(arrays), protocol=-1)))
# 11833
like image 189
shoyer Avatar answered Sep 20 '22 12:09

shoyer