I have a dataframe that I am looking at the data types associated with each column.
When I run:
In [23]: df.dtype.descr
Out [24]: [(u'date', '<i8'), (u'open', '<f8'), (u'high', '<f8'), (u'low', '<f8'), (u'close', '<f8'), (u'volume', '<f8'), (u'dividend', '<f8'), (u'adj_factor', '<f8'), (u'split_factor', '<f8'), (u'liq', '<f8'), (u'currency', '|O')]
I want to set the currency dtype to S7. I am doing:
In [25]: dtype_new[-1] = (u'currency', "|S7")
In [26]: print dtype_new
Out [27]: [(u'date', '<i8'), (u'open', '<f8'), (u'high', '<f8'), (u'low', '<f8'), (u'close', '<f8'), (u'volume', '<f8'), (u'dividend', '<f8'), (u'adj_factor', '<f8'), (u'split_factor', '<f8'), (u'liq', '<f8'), (u'currency', '|S7')]
It looks to be the correct format. So I try to put it back to my df:
In [28]: df = df.astype(np.dtype(dtype_new))
And I get the error:
TypeError('data type not understood',)
What should I be changing? Thank you. This was working before I recently updated anaconda and I am not aware of the issue. Thanks.
ADJUSTMENT:
df.dtype is
In [23]: records.dtype
Out[23]: dtype((numpy.record, [(u'date', '<i8'), (u'open', '<f8'), (u'high', '<f8'), (u'low', '<f8'), (u'close', '<f8'), (u'volume', '<f8'), (u'dividend', '<f8'), (u'adj_factor', '<f8'), (u'split_factor', '<f8'), (u'liq', '<f8'), (u'currency', 'O')]))
How can I change the '0' to a string less than 7 characters?
How can I change the last dtype from 'O' to something else? Specifically a string less than 7 characters.
LASTLY - is this a unicode issue? With Unicode:
In [38]: np.dtype([(u'date', '<i8')])
...:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-38-8702f0c7681f> in <module>()
----> 1 np.dtype([(u'date', '<i8')])
TypeError: data type not understood
No Unicode:
In [39]: np.dtype([('date', '<i8')])
Out[39]: dtype([('date', '<i8')])
It means: 'O' (Python) objects. Source. The first character specifies the kind of data and the remaining characters specify the number of bytes per item, except for Unicode, where it is interpreted as the number of characters. The item size must correspond to an existing type, or an error will be raised.
In order to change the dtype of the given array object, we will use numpy. astype() function. The function takes an argument which is the target data type. The function supports all the generic types and built-in types of data.
dtype dtype('uint8') dtype objects also contain information about the type, such as its bit-width and its byte-order. The data type can also be used indirectly to query properties of the type, such as whether it is an integer: >>> d = np. dtype(int) >>> d dtype('int32') >>> np.
# dtype('<U11') In the first case, each element of the list we pass to the array constructor is an integer. Therefore, NumPy decides that the dtype should be integer (32 bit integer to be precise). In the second case, one of the elements (3.0) is a floating-point number.
It seems you have centered the point about unicode and, actually, you seem to have touched on a sore point.
Let's start from the last numpy documentation.
The documentation dtypes states that:
[(field_name, field_dtype, field_shape), ...]
obj should be a list of fields where each field is described by a tuple of length 2 or 3. (Equivalent to the
descr
item in the__array_interface__
attribute.)The first element,
field_name
, is the field name (if this is''
then a standard field name, 'f#', is assigned). The field name may also be a 2-tuple of strings where the first string is either a “title” (which may be any string or unicode string) or meta-data for the field which can be any object, and the second string is the “name” which must be a valid Python identifier. The second element,field_dtype
, can be anything that can be interpreted as a data-type. The optional third elementfield_shape
contains the shape if this field represents an array of the data-type in the second element. Note that a 3-tuple with a third argument equal to 1 is equivalent to a 2-tuple. This style does not accept align in the dtype constructor as it is assumed that all of the memory is accounted for by the array interface description.
So the doc doesn't seem to really specify whether the field name can be unicode, what we can be sure from the doc is that if we define a tuple as the field name, e.g. ((u'date', 'date'), '<i8')
, then using unicode as the "title" (notice, still not for the name!), leads to no errors.
Otherwise, also in this case, if you define ((u'date', u'date'), '<i8')
you will get an error.
Now, you can use unicode names in Py2 by using the encode("ascii")
(u'date'.encode("ascii"))
and this should work.
One big point is that for Py2, Numpy does not allow to specify dtype
with unicode field names as list of tuples, but allows it using dictionaries.
If I don't use unicode names in Py2, I can change the last field from |0
to |S7
or you have to use the encode("ascii")
if you define the name as unicode string.
And the bugs involved...
To understand why it happens what you see, it is useful to have a look at the bugs/issues reported in Numpy and Pandas and the relative discussions.
Numpy
https://github.com/numpy/numpy/issues/2407
You can notice in the discussion (which I do not report here) mainly a couple of things:
encode("ascii")
on the unicode string'whatever'
string has different defaults (bytes/unicode) in Py2/3{'names':[ alist], 'formats':[alist]...}
, the py2 case also allows unicode names" Pandas
Also on the pandas side an issue has been reported which relates to the numpy issue: https://github.com/pandas-dev/pandas/pull/13462
It seems to have been fixed not that long ago.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With