Consider two ways of naively making the same bytearray
(using Python 2.7.11, but confirmed same behavior in 3.4.3 as well):
In [80]: from array import array
In [81]: import numpy as np
In [82]: a1 = array('L', [1, 3, 2, 5, 4])
In [83]: a2 = np.asarray([1,3,2,5,4], dtype=int)
In [84]: b1 = bytearray(a1)
In [85]: b2 = bytearray(a2)
Since both array.array
and numpy.ndarray
support the buffer protocol, I would expect both to export the same underlying data on conversion to bytearray
.
But the data from above:
In [86]: b1
Out[86]: bytearray(b'\x01\x03\x02\x05\x04')
In [87]: b2
Out[87]: bytearray(b'\x01\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x05\x00\x00\x00\x00\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\x00')
At first I thought maybe a naive call to bytearray
on a NumPy array will inadvertently get some extra bytes due to data type, contiguity, or some other overhead data.
But even when looking at the NumPy buffer data handle directly, it still says size is 40 and gives the same data:
In [90]: a2.data
Out[90]: <read-write buffer for 0x7fb85d60fee0, size 40, offset 0 at 0x7fb85d668fb0>
In [91]: bytearray(a2.data)
Out[91]: bytearray(b'\x01\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x05\x00\x00\x00\x00\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\x00')
The same failing happens with a2.view()
:
In [93]: bytearray(a2.view())
Out[93]: bytearray(b'\x01\x00\x00\x00\x00\x00\x00\x00\x03\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00\x05\x00\x00\x00\x00\x00\x00\x00\x04\x00\x00\x00\x00\x00\x00\x00')
I noted that if I gave dtype=np.int32
then the length of bytearray(a2)
is 20 instead of 40, suggesting that the extra bytes have to do with type information -- it's just not clear why or how:
In [20]: a2 = np.asarray([1,3,2,5,4], dtype=int)
In [21]: len(bytearray(a2.data))
Out[21]: 40
In [22]: a2 = np.asarray([1,3,2,5,4], dtype=np.int32)
In [23]: len(bytearray(a2.data))
Out[23]: 20
AFAICT, np.int32
ought to correspond to the array
'L'
typecode, but any explanations about why not would be massively helpful.
How can one reliably extract only the part of the data that "should" be exported via the buffer protocol ... as in, the same as what the plain array
data looks like in this case.
When you create your bytearray from the array.array
, it is treating it as an iterable of ints, not as a buffer. You can see this because:
>>> bytearray(a1)
bytearray(b'\x01\x03\x02\x05\x04')
>>> bytearray(buffer(a1))
bytearray(b'\x01\x00\x00\x00\x03\x00\x00\x00\x02\x00\x00\x00\x05\x00\x00\x00\x04\x00\x00\x00')
That is, creating a bytearray directly from the array gives you "plain" ints, but creating a bytearray from a buffer of the array gives you the actual byte representations of those ints. Also, you cannot create a bytearray from an array that has ints that won't fit into a single byte:
>>> bytearray(array.array(b'L', [256]))
Traceback (most recent call last):
File "<pyshell#38>", line 1, in <module>
bytearray(array.array(b'L', [256]))
ValueError: byte must be in range(0, 256)
The behavior is still puzzling, though, because both array.array
and np.ndarray
support both the buffer protocol and iteration, yet somehow creating a bytearray from a array.array
gets the data via iteration, while creating a bytearray from a numpy.ndarray
gets the data via the buffer protocol. There is presumably some arcane explanation for this switched priority in the C internals of these two types, but I have no idea what it is.
In any case, it's not really correct to say that what you're seeing with your a1
is what "should" happen; as I showed above, the data '\x01\x03\x02\x05\x04'
is not actually what array.array
exposes via the buffer protocol. If anything, the behavior with the numpy array is what you "should" get from the buffer protocol; it is the array.array
behavior that is not consistent with the buffer protocol.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With