Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

When creating a Python bytearray from NumPy array, where does the extra data come from?

Consider two ways of naively making the same bytearray (using Python 2.7.11, but confirmed same behavior in 3.4.3 as well):

In [80]: from array import array

In [81]: import numpy as np    

In [82]: a1 = array('L',  [1, 3, 2, 5, 4])

In [83]: a2 = np.asarray([1,3,2,5,4], dtype=int)

In [84]: b1 = bytearray(a1)

In [85]: b2 = bytearray(a2)

Since both array.array and numpy.ndarray support the buffer protocol, I would expect both to export the same underlying data on conversion to bytearray.

But the data from above:

In [86]: b1
Out[86]: bytearray(b'\x01\x03\x02\x05\x04')

In [87]: b2
Out[87]: bytearray(b'\x01\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x05\x00\x00\x00\x00\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\x00')

At first I thought maybe a naive call to bytearray on a NumPy array will inadvertently get some extra bytes due to data type, contiguity, or some other overhead data.

But even when looking at the NumPy buffer data handle directly, it still says size is 40 and gives the same data:

In [90]: a2.data
Out[90]: <read-write buffer for 0x7fb85d60fee0, size 40, offset 0 at 0x7fb85d668fb0>

In [91]: bytearray(a2.data)
Out[91]: bytearray(b'\x01\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x05\x00\x00\x00\x00\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\x00')

The same failing happens with a2.view():

In [93]: bytearray(a2.view())
Out[93]: bytearray(b'\x01\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x05\x00\x00\x00\x00\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\x00')

I noted that if I gave dtype=np.int32 then the length of bytearray(a2) is 20 instead of 40, suggesting that the extra bytes have to do with type information -- it's just not clear why or how:

In [20]: a2 = np.asarray([1,3,2,5,4], dtype=int)

In [21]: len(bytearray(a2.data))
Out[21]: 40

In [22]: a2 = np.asarray([1,3,2,5,4], dtype=np.int32)

In [23]: len(bytearray(a2.data))
Out[23]: 20

AFAICT, np.int32 ought to correspond to the array 'L' typecode, but any explanations about why not would be massively helpful.

How can one reliably extract only the part of the data that "should" be exported via the buffer protocol ... as in, the same as what the plain array data looks like in this case.

like image 240
ely Avatar asked Sep 03 '25 01:09

ely


1 Answers

When you create your bytearray from the array.array, it is treating it as an iterable of ints, not as a buffer. You can see this because:

>>> bytearray(a1)
bytearray(b'\x01\x03\x02\x05\x04')
>>> bytearray(buffer(a1))
bytearray(b'\x01\x00\x00\x00\x03\x00\x00\x00\x02\x00\x00\x00\x05\x00\x00\x00\x04\x00\x00\x00')

That is, creating a bytearray directly from the array gives you "plain" ints, but creating a bytearray from a buffer of the array gives you the actual byte representations of those ints. Also, you cannot create a bytearray from an array that has ints that won't fit into a single byte:

>>> bytearray(array.array(b'L', [256]))
Traceback (most recent call last):
  File "<pyshell#38>", line 1, in <module>
    bytearray(array.array(b'L', [256]))
ValueError: byte must be in range(0, 256)

The behavior is still puzzling, though, because both array.array and np.ndarray support both the buffer protocol and iteration, yet somehow creating a bytearray from a array.array gets the data via iteration, while creating a bytearray from a numpy.ndarray gets the data via the buffer protocol. There is presumably some arcane explanation for this switched priority in the C internals of these two types, but I have no idea what it is.

In any case, it's not really correct to say that what you're seeing with your a1 is what "should" happen; as I showed above, the data '\x01\x03\x02\x05\x04' is not actually what array.array exposes via the buffer protocol. If anything, the behavior with the numpy array is what you "should" get from the buffer protocol; it is the array.array behavior that is not consistent with the buffer protocol.

like image 127
BrenBarn Avatar answered Sep 04 '25 14:09

BrenBarn