I just wanted to confirm if the default data type for string is unicode
while creating a ndarray
. I could not find any reference which states this clearly. May be it is too obvious and doesn't need stating.
When dtype is specified:
>>> import numpy as np
>>> g = np.array([['a', 'b'],['c', 'd']], dtype='S')
>>> g
array([[b'a', b'b'],
[b'c', b'd']],
dtype='|S1')
Without specifying the dtype:
>>> g = np.array([['a', 'b'],['c', 'd']])
>>> g
array([['a', 'b'],
['c', 'd']],
dtype='<U1')
Also, what does the literal b
indicate when dtype is specified. As per the documentation, it indicates bool
which doesn't seem to be the case here.
Can some one please clarify?
dtype class) describes how the bytes in the fixed-size block of memory corresponding to an array item should be interpreted. It describes the following aspects of the data: Type of the data (integer, float, Python object, etc.) Size of the data (how many bytes is in e.g. the integer)
>>> np. linspace(0, 10, num=5) array([ 0. , 2.5, 5. , 7.5, 10. ]) While the default data type is floating point ( np. float64 ), you can explicitly specify which data type you want using the dtype keyword.
It means: 'O' (Python) objects. Source. The first character specifies the kind of data and the remaining characters specify the number of bytes per item, except for Unicode, where it is interpreted as the number of characters. The item size must correspond to an existing type, or an error will be raised.
The dtype of any numpy array containing string values is the maximum length of any string present in the array. Once set, it will only be able to store new string having length not more than the maximum length at the time of the creation.
b'...'
means it's a byte-string and the default dtype for arrays of strings depends on the kind of strings. Unicodes (python 3 strings are unicode) are U
and Python 2 str
or Python 3 bytes
have the dtype S
. You can find the explanation of dtypes in the NumPy documentation here
Array-protocol type strings
The first character specifies the kind of data and the remaining characters specify the number of bytes per item, except for Unicode, where it is interpreted as the number of characters. The item size must correspond to an existing type, or an error will be raised. The supported kinds are:
- '?' boolean
- 'b' (signed) byte
- 'B' unsigned byte
- 'i' (signed) integer
- 'u' unsigned integer
- 'f' floating-point
- 'c' complex-floating point
- 'm' timedelta
- 'M' datetime
- 'O' (Python) objects
- 'S', 'a' zero-terminated bytes (not recommended)
- 'U' Unicode string
- 'V' raw data (void)
However in your first case you actually forced NumPy to convert it to bytes because you specified dtype='S'
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With