Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Numpy dtype - data type not understood

I have a dataframe that I am looking at the data types associated with each column.

When I run:

In [23]: df.dtype.descr

Out [24]: [(u'date', '<i8'), (u'open', '<f8'), (u'high', '<f8'), (u'low', '<f8'), (u'close', '<f8'), (u'volume', '<f8'), (u'dividend', '<f8'), (u'adj_factor', '<f8'), (u'split_factor', '<f8'), (u'liq', '<f8'), (u'currency', '|O')]

I want to set the currency dtype to S7. I am doing:

In [25]: dtype_new[-1] = (u'currency', "|S7")
In [26]: print dtype_new
Out [27]: [(u'date', '<i8'), (u'open', '<f8'), (u'high', '<f8'), (u'low', '<f8'), (u'close', '<f8'), (u'volume', '<f8'), (u'dividend', '<f8'), (u'adj_factor', '<f8'), (u'split_factor', '<f8'), (u'liq', '<f8'), (u'currency', '|S7')]

It looks to be the correct format. So I try to put it back to my df:

In [28]: df = df.astype(np.dtype(dtype_new))

And I get the error:

TypeError('data type not understood',)

What should I be changing? Thank you. This was working before I recently updated anaconda and I am not aware of the issue. Thanks.

ADJUSTMENT:

df.dtype is

In [23]: records.dtype
Out[23]: dtype((numpy.record, [(u'date', '<i8'), (u'open', '<f8'), (u'high',     '<f8'), (u'low', '<f8'), (u'close', '<f8'), (u'volume', '<f8'), (u'dividend', '<f8'), (u'adj_factor', '<f8'), (u'split_factor', '<f8'), (u'liq', '<f8'), (u'currency', 'O')]))

How can I change the '0' to a string less than 7 characters?

How can I change the last dtype from 'O' to something else? Specifically a string less than 7 characters.

LASTLY - is this a unicode issue? With Unicode:

In [38]: np.dtype([(u'date', '<i8')]) 
    ...: 
    ---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call     last)
<ipython-input-38-8702f0c7681f> in <module>()
----> 1 np.dtype([(u'date', '<i8')])

TypeError: data type not understood

No Unicode:

In [39]: np.dtype([('date', '<i8')])
Out[39]: dtype([('date', '<i8')])
like image 415
user1911092 Avatar asked Sep 20 '17 18:09

user1911092


People also ask

What is NumPy Dtype (' O ')?

It means: 'O' (Python) objects. Source. The first character specifies the kind of data and the remaining characters specify the number of bytes per item, except for Unicode, where it is interpreted as the number of characters. The item size must correspond to an existing type, or an error will be raised.

How do I change the Dtype of a NumPy array?

In order to change the dtype of the given array object, we will use numpy. astype() function. The function takes an argument which is the target data type. The function supports all the generic types and built-in types of data.

What is Dtype NP uint8?

dtype dtype('uint8') dtype objects also contain information about the type, such as its bit-width and its byte-order. The data type can also be used indirectly to query properties of the type, such as whether it is an integer: >>> d = np. dtype(int) >>> d dtype('int32') >>> np.

What is Dtype U11 NumPy?

# dtype('<U11') In the first case, each element of the list we pass to the array constructor is an integer. Therefore, NumPy decides that the dtype should be integer (32 bit integer to be precise). In the second case, one of the elements (3.0) is a floating-point number.


1 Answers

It seems you have centered the point about unicode and, actually, you seem to have touched on a sore point.

Let's start from the last numpy documentation.

The documentation dtypes states that:

[(field_name, field_dtype, field_shape), ...]

obj should be a list of fields where each field is described by a tuple of length 2 or 3. (Equivalent to the descr item in the __array_interface__ attribute.)

The first element, field_name, is the field name (if this is '' then a standard field name, 'f#', is assigned). The field name may also be a 2-tuple of strings where the first string is either a “title” (which may be any string or unicode string) or meta-data for the field which can be any object, and the second string is the “name” which must be a valid Python identifier. The second element, field_dtype, can be anything that can be interpreted as a data-type. The optional third element field_shape contains the shape if this field represents an array of the data-type in the second element. Note that a 3-tuple with a third argument equal to 1 is equivalent to a 2-tuple. This style does not accept align in the dtype constructor as it is assumed that all of the memory is accounted for by the array interface description.

So the doc doesn't seem to really specify whether the field name can be unicode, what we can be sure from the doc is that if we define a tuple as the field name, e.g. ((u'date', 'date'), '<i8'), then using unicode as the "title" (notice, still not for the name!), leads to no errors.
Otherwise, also in this case, if you define ((u'date', u'date'), '<i8') you will get an error.

Now, you can use unicode names in Py2 by using the encode("ascii")

(u'date'.encode("ascii"))  

and this should work.
One big point is that for Py2, Numpy does not allow to specify dtype with unicode field names as list of tuples, but allows it using dictionaries.

If I don't use unicode names in Py2, I can change the last field from |0 to |S7 or you have to use the encode("ascii") if you define the name as unicode string.


And the bugs involved...

To understand why it happens what you see, it is useful to have a look at the bugs/issues reported in Numpy and Pandas and the relative discussions.

Numpy
https://github.com/numpy/numpy/issues/2407
You can notice in the discussion (which I do not report here) mainly a couple of things:

  • the "issue" has been going on for a while
  • one trick people used was to use encode("ascii") on the unicode string
  • remember that the 'whatever' string has different defaults (bytes/unicode) in Py2/3
  • @hpaulj himself commented beautifully in that issue report that "If the dtype specification is of the list of tuples type, it checks whether each name is a string (as defined by py2 or 3) But if the dtype specification is a dictionary {'names':[ alist], 'formats':[alist]...}, the py2 case also allows unicode names"

Pandas
Also on the pandas side an issue has been reported which relates to the numpy issue: https://github.com/pandas-dev/pandas/pull/13462
It seems to have been fixed not that long ago.

like image 115
fedepad Avatar answered Oct 05 '22 11:10

fedepad