I learned from the comment in this answer Python: Numpy Data IO, how to save data by different dtype for each column? that < means byte order, U means unicode, and 5 means the number of characters.
Then what does '|' means in '|U5' and why did '|' change to '<' in the below example? The example is from NumPy official documentation: https://numpy.org/devdocs/user/basics.io.genfromtxt.html
data = u"1, abc , 2\n 3, xxx, 4"
# Without autostrip
np.genfromtxt(StringIO(data), delimiter=",", dtype="|U5")
array([['1', ' abc ', ' 2'],
['3', ' xxx', ' 4']], dtype='<U5')
The U data type stores each Unicode character as a 32 bit integer (i.e. 4 bytes). An integer with more than one byte must have an endianess, so the data type will show either < or >. Arrays of type S store each character in a single byte, so endianess is irrelevant, and the endianess character will be |.
For example, here a1 and a2 contain a single Unicode character. The arrays are created with opposite endianess.
In [248]: a1 = np.array(['π'], dtype='<U1')
In [249]: a2 = np.array(['π'], dtype='>U1')
In [250]: a1
Out[250]: array(['π'], dtype='<U1')
In [251]: a2
Out[251]: array(['π'], dtype='>U1')
Inspect the actual bytes of the data in each array; you can see the different orders for each type:
In [252]: a1.view(np.uint8)
Out[252]: array([192, 3, 0, 0], dtype=uint8)
In [253]: a2.view(np.uint8)
Out[253]: array([ 0, 0, 3, 192], dtype=uint8)
When you specify | with a Unicode type when creating an array, apparently NumPy ignores it and uses the native byte order, e.g.
In [254]: np.dtype("|U5")
Out[254]: dtype('<U5')
One might as well not include it at all:
In [255]: np.dtype("U5")
Out[255]: dtype('<U5')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With