Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

With a NumPy Unicode array, why does dtype '|U5' become dtype '<U5'?

I learned from the comment in this answer Python: Numpy Data IO, how to save data by different dtype for each column? that < means byte order, U means unicode, and 5 means the number of characters.

Then what does '|' means in '|U5' and why did '|' change to '<' in the below example? The example is from NumPy official documentation: https://numpy.org/devdocs/user/basics.io.genfromtxt.html

data = u"1, abc , 2\n 3, xxx, 4"
# Without autostrip
np.genfromtxt(StringIO(data), delimiter=",", dtype="|U5")

array([['1', ' abc ', ' 2'],
       ['3', ' xxx', ' 4']], dtype='<U5')
like image 616
user67275 Avatar asked Nov 17 '25 12:11

user67275


1 Answers

The U data type stores each Unicode character as a 32 bit integer (i.e. 4 bytes). An integer with more than one byte must have an endianess, so the data type will show either < or >. Arrays of type S store each character in a single byte, so endianess is irrelevant, and the endianess character will be |.

For example, here a1 and a2 contain a single Unicode character. The arrays are created with opposite endianess.

In [248]: a1 = np.array(['π'], dtype='<U1')

In [249]: a2 = np.array(['π'], dtype='>U1')

In [250]: a1
Out[250]: array(['π'], dtype='<U1')

In [251]: a2
Out[251]: array(['π'], dtype='>U1')

Inspect the actual bytes of the data in each array; you can see the different orders for each type:

In [252]: a1.view(np.uint8)
Out[252]: array([192,   3,   0,   0], dtype=uint8)

In [253]: a2.view(np.uint8)
Out[253]: array([  0,   0,   3, 192], dtype=uint8)

When you specify | with a Unicode type when creating an array, apparently NumPy ignores it and uses the native byte order, e.g.

In [254]: np.dtype("|U5")
Out[254]: dtype('<U5')

One might as well not include it at all:

In [255]: np.dtype("U5")
Out[255]: dtype('<U5')
like image 150
Warren Weckesser Avatar answered Nov 20 '25 04:11

Warren Weckesser