Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the default dtype for str like input in numpy?

I just wanted to confirm if the default data type for string is unicode while creating a ndarray. I could not find any reference which states this clearly. May be it is too obvious and doesn't need stating.

When dtype is specified:

>>> import numpy as np
>>> g = np.array([['a', 'b'],['c', 'd']], dtype='S')
>>> g
array([[b'a', b'b'],
       [b'c', b'd']], 
      dtype='|S1')

Without specifying the dtype:

>>> g = np.array([['a', 'b'],['c', 'd']])
>>> g
array([['a', 'b'],
       ['c', 'd']], 
      dtype='<U1')

Also, what does the literal b indicate when dtype is specified. As per the documentation, it indicates bool which doesn't seem to be the case here.

Can some one please clarify?

like image 607
Isha Garg Avatar asked Sep 05 '17 09:09

Isha Garg


People also ask

What is Dtype for string in NumPy?

dtype class) describes how the bytes in the fixed-size block of memory corresponding to an array item should be interpreted. It describes the following aspects of the data: Type of the data (integer, float, Python object, etc.) Size of the data (how many bytes is in e.g. the integer)

What is the default data type of each element in the array in NumPy?

>>> np. linspace(0, 10, num=5) array([ 0. , 2.5, 5. , 7.5, 10. ]) While the default data type is floating point ( np. float64 ), you can explicitly specify which data type you want using the dtype keyword.

What is NumPy Dtype (' O ')?

It means: 'O' (Python) objects. Source. The first character specifies the kind of data and the remaining characters specify the number of bytes per item, except for Unicode, where it is interpreted as the number of characters. The item size must correspond to an existing type, or an error will be raised.

Can NumPy contain strings?

The dtype of any numpy array containing string values is the maximum length of any string present in the array. Once set, it will only be able to store new string having length not more than the maximum length at the time of the creation.


1 Answers

b'...' means it's a byte-string and the default dtype for arrays of strings depends on the kind of strings. Unicodes (python 3 strings are unicode) are U and Python 2 str or Python 3 bytes have the dtype S. You can find the explanation of dtypes in the NumPy documentation here

Array-protocol type strings

The first character specifies the kind of data and the remaining characters specify the number of bytes per item, except for Unicode, where it is interpreted as the number of characters. The item size must correspond to an existing type, or an error will be raised. The supported kinds are:

  • '?' boolean
  • 'b' (signed) byte
  • 'B' unsigned byte
  • 'i' (signed) integer
  • 'u' unsigned integer
  • 'f' floating-point
  • 'c' complex-floating point
  • 'm' timedelta
  • 'M' datetime
  • 'O' (Python) objects
  • 'S', 'a' zero-terminated bytes (not recommended)
  • 'U' Unicode string
  • 'V' raw data (void)

However in your first case you actually forced NumPy to convert it to bytes because you specified dtype='S'.

like image 166
MSeifert Avatar answered Nov 14 '22 22:11

MSeifert