How does np.ndarray.tobytes() work for dtype "object"?

Tags:

numpy

I encountered a strange behavior of np.ndarray.tobytes() that makes me doubt that it is working deterministically, at least for arrays of dtype=object.

import numpy as np
print(np.array([1,[2]]).dtype)
# => object
print(np.array([1,[2]]).tobytes())
# => b'0h\xa3\t\x01\x00\x00\x00H{!-\x01\x00\x00\x00'
print(np.array([1,[2]]).tobytes())
# => b'0h\xa3\t\x01\x00\x00\x00\x88\x9d)-\x01\x00\x00\x00'

In the sample code, a list of mixed python objects ([1, [2]]) is first converted to a numpy array, and then transformed to a byte sequence using tobytes().

Why do the resulting byte-representations differ for repeated instantiations of the same data? The documentation just states that it converts an ndarray to raw python bytes, but it does not refer to any limitations. So far, I observed this just for dtype=object. Numeric arrays always yield the same byte sequence:

np.random.seed(42); print(np.random.rand(3).tobytes())
# b'\xecQ_\x1ew\xf8\xd7?T\xd6\xbbh@l\xee?Qg\x1e\x8f~l\xe7?'
np.random.seed(42); print(np.random.rand(3).tobytes())
# b'\xecQ_\x1ew\xf8\xd7?T\xd6\xbbh@l\xee?Qg\x1e\x8f~l\xe7?'

Have I missed an elementar thing about python's/numpy's memory architecture? I tested with numpy version 1.17.2 on a Mac.

Context: I encountered this problem when trying to compute a hash for arbitrary data structures. I hoped that I can rely on the basic serialization capabilities of tobytes(), but this appears to be a wrong premise. I know that pickle is the standard for serialization in python, but since I don't require portability and my data structures only contain numbers, I first sought help with numpy.

541

asked Mar 03 '20 15:03

normanius

1 Answers

An array of dtype object stores pointers to the objects it contains. In CPython, this corresponds to the id. Every time you create a new list, it will be allocated at a new memory address. However, small integers are interned, so 1 will reference the same integer object every time.

You can see exactly how this works by checking the IDs of some sample objects:

>>> x = np.array([1, [2]])
>>> x.tobytes()
b'\x90\x91\x04a\xfb\x7f\x00\x00\xc8[J\xaa+\x02\x00\x00'
>>> id(x[0])
140717641208208
>>> id(1)                             # Small integers are interned
140717641208208
>>> id(x[0]).to_bytes(8, 'little')    # Checks out as the first 8 bytes
b'\x90\x91\x04a\xfb\x7f\x00\x00'
>>> id(x[1]).to_bytes(8, 'little')    # Checks out as the last 8 bytes
b'\xc8[J\xaa+\x02\x00\x00'

As you can see, it is quite deterministic, but serializes information that is essentially useless to you. The operation is the same for numeric arrays as for object arrays: it returns a view or copy of the underlying buffer. The contents of the buffer is what is throwing you off.

Since you mentioned that you are computing hashes, keep in mind that there is a reason that python lists are unhashable. You can have lists that are equal at one time and different at another. Using IDs is generally not a good idea for an effective hash.

181

answered Nov 15 '22 11:11

Mad Physicist

Related questions
                            
                                Numpy error when converting array of ctypes types to void pointer
                            
                                Problem with autocompletion with Pandas in Jupyter
                            
                                Is it possible to pass multiple dictionary in enchant?
                            
                                How to get the table name from Spark SQL Query [PySpark]?
                            
                                Relationship between Eager Execution and tf.function
                            
                                How do I ensure that a generator gets properly closed?
                            
                                How to make python halt once target product is found in subset?
                            
                                Spatial Join between pyspark dataframe and polygons (geopandas)
                            
                                How to pass dataframe column value as window size after df.groupby?
                            
                                Custom Spider chart --> Display curves instead of lines between point on a polar plot in matplotlib
                            
                                What is a buffer in Pytorch?
                            
                                Why can I not import load_dotenv?
                            
                                ConnectionResetError: [WinError 10054] An existing connection was forcibly closed by the remote host error with ChromeDriver Chrome Selenium Django
                            
                                FileRequiredValidator() doesn't work when using MultipleFileField() in my form
                            
                                Networkx Traveling Salesman Problem (TSP)
                            
                                Tensorflow 2.1.0 Error, module 'tensorflow' has no attribute 'GraphKeys'
                            
                                How to open huge parquet file using Pandas without enough RAM
                            
                                Copy performance: list vs array
                            
                                detect key press in python, where each iteration can take more than a couple of seconds?
                            
                                What is the purpose of the class meta in Django?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With