Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Applying function to columns of a Pandas DataFrame, conditional on data type

Tags:

python

pandas

I want to use the common pattern to apply a function to every column in a Pandas DataFrame, but the function should work conditional on the column data type.

Sounds simple enough. But I found a weird behavior in testing for the data type and I cannot find anywhere in the docs or googling the reason for it.

Consider this repex:

import pandas as pd

toydf = pd.DataFrame(dict(
    A = [1, 2, 3],
    B = [1.1, 1.2, 1.3],
    C = ['1', '2', '3'],
    D = [True, True, False]
))

Checking individually the dtypes they are dtype('int64'), dtype('float64'), dtype('O'), dtype('bool')

But if I use the apply function, all columns passed to the function are dtype: object.

def dtype_fn(the_col):
    print(the_col)
    return(the_col.dtype)

toydf.apply(dtype_fn)

toydf.apply(dtype_fn)
0    1
1    2
2    3
Name: A, dtype: object
0    1.1
1    1.2
2    1.3
Name: B, dtype: object
0    1
1    2
2    3
Name: C, dtype: object
0     True
1     True
2    False
Name: D, dtype: object
Out[167]: 
A    object
B    object
C    object
D    object
dtype: object

Why is this?, what I am doing wrong?, why the columns does not retain the original data types?

Here's an approach that works and produced my desired output: (but for encapsulation reasons, I don't like it)

def dtype_fn2(col_name):
    return(toydf[col_name].dtype)

[dtype_fn2(col) for col in toydf.columns]

Out[173]: [dtype('int64'), dtype('float64'), dtype('O'), dtype('bool')]
like image 575
Hernando Casas Avatar asked Mar 15 '19 10:03

Hernando Casas


People also ask

How do you apply a function to a column of a DataFrame in pandas?

In some cases we would want to apply a function on all pandas columns, you can do this using apply() function. Here the add_3() function will be applied to all DataFrame columns.

How do I create a conditional column in pandas?

You can create a conditional column in pandas DataFrame by using np. where() , np. select() , DataFrame. map() , DataFrame.

How do I apply a function to multiple columns in a data frame?

To apply a function that takes as input multiple column values, use the DataFrame's apply(~) method.


2 Answers

This comment is correct. This behaviour is by design. Pandas "applies" the type that is highest up in the type hierarchy for all dtypes given.

Consider applying the function to only "A",

df[['A']].apply(dtype_fn)
int64

A    int64
dtype: object

And similarly, with only "A" and "B",

df[['A', 'B']].apply(dtype_fn)
float64
float64

A    float64
B    float64
dtype: object

Since you have multiple types, including string in your original DataFrame, the common type for them all is object.


Now this explains the behaviour, but I still need to address the fix. Pandas offers a useful method: Series.infer_objects which infers the dtype and performs a "soft conversion".

If you really need the type in the function, you can perform a soft cast before calling dtype. This produces the expected result:

def dtype_fn(the_col):
     the_col = the_col.infer_objects()
     print(the_col.dtype)

     return(the_col.dtype)

df.apply(dtype_fn)
int64
float64
object
bool

A      int64
B    float64
C     object
D       bool
dtype: object
like image 165
cs95 Avatar answered Oct 24 '22 23:10

cs95


The actual input to your dtype_fn is a Pandas Series object. You can access the underlying type by modifying your method slightly.

def dtype_fn(the_col):
    print(the_col.values.dtype)
    return(the_col.values.dtype)

For more info about why this is the case, you can have a look at this answer. There it says

This is not an error but is due to the numpy dtype representation: https://docs.scipy.org/doc/numpy/reference/arrays.scalars.html.

like image 33
Adam Fjeldsted Avatar answered Oct 24 '22 23:10

Adam Fjeldsted