Why does Pandas coerce my numpy float32 to float64 in this piece of code:
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame([[1, 2, 'a'], [3, 4, 'b']], dtype=np.float32)
>>> A = df.ix[:, 0:1].values
>>> df.ix[:, 0:1] = A
>>> df[0].dtype
dtype('float64')
The behavior seems so odd to me that wonder if it is a bug. I am on Pandas version 0.17.1 (updated PyPI version) and I note there has been coercing bugs recently addressed, see https://github.com/pydata/pandas/issues/11847 . I haven't tried the piece of code with an updated GitHub master.
Is it a bug or do I misunderstand some "feature" in Pandas? If it is a feature, then how do I get around it?
(The coercing problem relates to a question I recently asked about the performance of Pandas assignments: Assignment of Pandas DataFrame with float32 and float64 slow)
float32 is a 32 bit number - float64 uses 64 bits. That means that float64's take up twice as much memory - and doing operations on them may be a lot slower in some machine architectures. However, float64's can represent numbers much more accurately than 32 bit floats. They also allow much larger numbers to be stored.
At least on intel, float64 should be faster than float32 since all math is done on the fpu in 64 bits, so it needs to be converted, but the memory bus also comes into play.
Python's floating-point numbers are usually 64-bit floating-point numbers, nearly equivalent to np.
float is one of the available numeric data types in Go used to store decimal numbers. float32 is a version of float that stores decimal values composed of 32 bits of data.
I think it is worth posting this as a GitHub issue. The behavior is certainly inconsistent.
The code takes a different branch based on whether the DataFrame is mixed-type or not (source).
In the mixed-type case the ndarray is converted to a Python list of float64 numbers and then converted back into float64 ndarray disregarding the DataFrame's dtypes information (function maybe_convert_objects()).
In the non-mixed-type case the DataFrame content is updated pretty much directly (source) and the DataFrame keeps its float32 dtypes.
Not an answer, but my recreation of the problem:
In [2]: df = pd.DataFrame([[1, 2, 'a'], [3, 4, 'b']], dtype=np.float32)
In [3]: df.dtypes
Out[3]:
0 float32
1 float32
2 object
dtype: object
In [4]: A=df.ix[:,:1].values
In [5]: A
Out[5]:
array([[ 1., 2.],
[ 3., 4.]], dtype=float32)
In [6]: df.ix[:,:1] = A
In [7]: df.dtypes
Out[7]:
0 float64
1 float64
2 object
dtype: object
In [8]: pd.__version__
Out[8]: '0.15.0'
I'm not as familiar with pandas
as numpy
, but I'm puzzled as to why ix[:,:1]
gives me a 2 column result. In numpy
that sort of indexing gives just 1 column.
If I assign a single column dtype
does not change
In [47]: df.ix[:,[0]]=A[:,0]
In [48]: df.dtypes
Out[48]:
0 float32
1 float32
2 object
The same actions without mixed datatypes does not change dtypes
In [100]: df1 = pd.DataFrame([[1, 2, 1.23], [3, 4, 3.32]], dtype=np.float32)
In [101]: A1=df1.ix[:,:1].values
In [102]: df1.ix[:,:1]=A1
In [103]: df1.dtypes
Out[103]:
0 float32
1 float32
2 float32
dtype: object
The key must be that with mixed values, the dataframe is, in one sense or other, a dtype=object
array, whether that's true of its internal data storage, or just its numpy
interface.
In [104]: df1.as_matrix()
Out[104]:
array([[ 1. , 2. , 1.23000002],
[ 3. , 4. , 3.31999993]], dtype=float32)
In [105]: df.as_matrix()
Out[105]:
array([[1.0, 2.0, 'a'],
[3.0, 4.0, 'b']], dtype=object)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With