I was using pandas
and numpy
to process some data until I got two similar output of arrays:
array(['french', 'mexican', 'cajun_creole', ..., 'southern_us', 'italian',
'thai'], dtype='<U12')
array(['french', 'mexican', 'cajun_creole', ..., 'jamaican', 'italian',
'thai'], dtype=object)
I don't see the difference, what is <U12
?
<U12
That is a numpy type:
<
Little Endian
U
Unicode
12
12 characters:
(Source)
The difference is in how elements are stored.
<U12
stores them flat, zero-padding each entry to length 12. To see this we can use tobytes
to directly access the data buffer:
>>> au
array(['french', 'mexican', 'cajun_creole', 'Ellipsis', 'southern_us',
'italian', 'thai'], dtype='<U12')
>>>
>>> sz = au.dtype.itemsize
>>> [au.tobytes()[i:i+sz].decode('utf32') for i in range(0, au.size * sz, sz)]
['french\x00\x00\x00\x00\x00\x00', 'mexican\x00\x00\x00\x00\x00', 'cajun_creole', 'Ellipsis\x00\x00\x00\x00', 'southern_us\x00', 'italian\x00\x00\x00\x00\x00', 'thai\x00\x00\x00\x00\x00\x00\x00\x00']
object
stores only object references, i.e. pointers to str
objects. We can verify this using the fact that---in the current CPython implementation---id
returns a Python object's memory address:
>>> ao
array(['french', 'mexican', 'cajun_creole', Ellipsis, 'southern_us',
'italian', 'thai'], dtype=object)
>>>
>>> sz = ao.dtype.itemsize
>>> [int.from_bytes(ao.tobytes()[i:i+sz], 'little') for i in range(0, ao.size * sz, sz)]
[140626141129896, 140625895652128, 140625895628080, 8856512, 140625895627504, 140626141132200, 140626343518024]
>>> [id(it) for it in ao]
[140626141129896, 140625895652128, 140625895628080, 8856512, 140625895627504, 140626141132200, 140626343518024]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With