Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does np.ndarray.tobytes() work for dtype "object"?

Tags:

python

numpy

I encountered a strange behavior of np.ndarray.tobytes() that makes me doubt that it is working deterministically, at least for arrays of dtype=object.

import numpy as np
print(np.array([1,[2]]).dtype)
# => object
print(np.array([1,[2]]).tobytes())
# => b'0h\xa3\t\x01\x00\x00\x00H{!-\x01\x00\x00\x00'
print(np.array([1,[2]]).tobytes())
# => b'0h\xa3\t\x01\x00\x00\x00\x88\x9d)-\x01\x00\x00\x00'

In the sample code, a list of mixed python objects ([1, [2]]) is first converted to a numpy array, and then transformed to a byte sequence using tobytes().

Why do the resulting byte-representations differ for repeated instantiations of the same data? The documentation just states that it converts an ndarray to raw python bytes, but it does not refer to any limitations. So far, I observed this just for dtype=object. Numeric arrays always yield the same byte sequence:

np.random.seed(42); print(np.random.rand(3).tobytes())
# b'\xecQ_\x1ew\xf8\xd7?T\xd6\xbbh@l\xee?Qg\x1e\x8f~l\xe7?'
np.random.seed(42); print(np.random.rand(3).tobytes())
# b'\xecQ_\x1ew\xf8\xd7?T\xd6\xbbh@l\xee?Qg\x1e\x8f~l\xe7?'

Have I missed an elementar thing about python's/numpy's memory architecture? I tested with numpy version 1.17.2 on a Mac.


Context: I encountered this problem when trying to compute a hash for arbitrary data structures. I hoped that I can rely on the basic serialization capabilities of tobytes(), but this appears to be a wrong premise. I know that pickle is the standard for serialization in python, but since I don't require portability and my data structures only contain numbers, I first sought help with numpy.

like image 541
normanius Avatar asked Mar 03 '20 15:03

normanius


People also ask

How does NumPy Tobytes work?

Construct Python bytes containing the raw data bytes in the array. Constructs Python bytes showing a copy of the raw contents of data memory. The bytes object can be produced in either 'C' or 'Fortran', or 'Any' order (the default is 'C'-order).

Can NumPy Ndarray hold any type of object?

NumPy arrays are typed arrays of fixed size. Python lists are heterogeneous and thus elements of a list may contain any object type, while NumPy arrays are homogenous and can contain object of only one type.

What is Ndarray Dtype?

Every ndarray has an associated data type (dtype) object. This data type object (dtype) informs us about the layout of the array. This means it gives us information about: Type of the data (integer, float, Python object, etc.) Size of the data (number of bytes)


1 Answers

An array of dtype object stores pointers to the objects it contains. In CPython, this corresponds to the id. Every time you create a new list, it will be allocated at a new memory address. However, small integers are interned, so 1 will reference the same integer object every time.

You can see exactly how this works by checking the IDs of some sample objects:

>>> x = np.array([1, [2]])
>>> x.tobytes()
b'\x90\x91\x04a\xfb\x7f\x00\x00\xc8[J\xaa+\x02\x00\x00'
>>> id(x[0])
140717641208208
>>> id(1)                             # Small integers are interned
140717641208208
>>> id(x[0]).to_bytes(8, 'little')    # Checks out as the first 8 bytes
b'\x90\x91\x04a\xfb\x7f\x00\x00'
>>> id(x[1]).to_bytes(8, 'little')    # Checks out as the last 8 bytes
b'\xc8[J\xaa+\x02\x00\x00'

As you can see, it is quite deterministic, but serializes information that is essentially useless to you. The operation is the same for numeric arrays as for object arrays: it returns a view or copy of the underlying buffer. The contents of the buffer is what is throwing you off.

Since you mentioned that you are computing hashes, keep in mind that there is a reason that python lists are unhashable. You can have lists that are equal at one time and different at another. Using IDs is generally not a good idea for an effective hash.

like image 181
Mad Physicist Avatar answered Nov 15 '22 11:11

Mad Physicist