I want to use the common pattern to apply a function to every column in a Pandas DataFrame, but the function should work conditional on the column data type.
Sounds simple enough. But I found a weird behavior in testing for the data type and I cannot find anywhere in the docs or googling the reason for it.
Consider this repex:
import pandas as pd
toydf = pd.DataFrame(dict(
A = [1, 2, 3],
B = [1.1, 1.2, 1.3],
C = ['1', '2', '3'],
D = [True, True, False]
))
Checking individually the dtypes they are dtype('int64'), dtype('float64'), dtype('O'), dtype('bool')
But if I use the apply
function, all columns passed to the function are dtype: object
.
def dtype_fn(the_col):
print(the_col)
return(the_col.dtype)
toydf.apply(dtype_fn)
toydf.apply(dtype_fn)
0 1
1 2
2 3
Name: A, dtype: object
0 1.1
1 1.2
2 1.3
Name: B, dtype: object
0 1
1 2
2 3
Name: C, dtype: object
0 True
1 True
2 False
Name: D, dtype: object
Out[167]:
A object
B object
C object
D object
dtype: object
Why is this?, what I am doing wrong?, why the columns does not retain the original data types?
Here's an approach that works and produced my desired output: (but for encapsulation reasons, I don't like it)
def dtype_fn2(col_name):
return(toydf[col_name].dtype)
[dtype_fn2(col) for col in toydf.columns]
Out[173]: [dtype('int64'), dtype('float64'), dtype('O'), dtype('bool')]
In some cases we would want to apply a function on all pandas columns, you can do this using apply() function. Here the add_3() function will be applied to all DataFrame columns.
You can create a conditional column in pandas DataFrame by using np. where() , np. select() , DataFrame. map() , DataFrame.
To apply a function that takes as input multiple column values, use the DataFrame's apply(~) method.
This comment is correct. This behaviour is by design. Pandas "applies" the type that is highest up in the type hierarchy for all dtypes given.
Consider applying the function to only "A",
df[['A']].apply(dtype_fn)
int64
A int64
dtype: object
And similarly, with only "A" and "B",
df[['A', 'B']].apply(dtype_fn)
float64
float64
A float64
B float64
dtype: object
Since you have multiple types, including string in your original DataFrame, the common type for them all is object
.
Now this explains the behaviour, but I still need to address the fix. Pandas offers a useful method: Series.infer_objects
which infers the dtype and performs a "soft conversion".
If you really need the type in the function, you can perform a soft cast before calling dtype
. This produces the expected result:
def dtype_fn(the_col):
the_col = the_col.infer_objects()
print(the_col.dtype)
return(the_col.dtype)
df.apply(dtype_fn)
int64
float64
object
bool
A int64
B float64
C object
D bool
dtype: object
The actual input to your dtype_fn
is a Pandas Series object. You can access the underlying type by modifying your method slightly.
def dtype_fn(the_col):
print(the_col.values.dtype)
return(the_col.values.dtype)
For more info about why this is the case, you can have a look at this answer. There it says
This is not an error but is due to the numpy dtype representation: https://docs.scipy.org/doc/numpy/reference/arrays.scalars.html.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With