I want to use the common pattern to apply a function to every column in a Pandas DataFrame, but the function should work conditional on the column data type. Sounds simple enough. But I found a weird behavior in testing for the data type and I cannot find anywhere in the docs or googling the reason for it. Consider this repex: <pre class="prettyprint"><code>import pandas as pd toydf = pd.DataFrame(dict( A = [1, 2, 3], B = [1.1, 1.2, 1.3], C = ['1', '2', '3'], D = [True, True, False] )) </code></pre> Checking individually the dtypes they are <code>dtype('int64'), dtype('float64'), dtype('O'), dtype('bool')</code> But if I use the <code>apply</code> function, all columns passed to the function are <code>dtype: object</code>. <pre class="prettyprint"><code>def dtype_fn(the_col): print(the_col) return(the_col.dtype) toydf.apply(dtype_fn) toydf.apply(dtype_fn) 0 1 1 2 2 3 Name: A, dtype: object 0 1.1 1 1.2 2 1.3 Name: B, dtype: object 0 1 1 2 2 3 Name: C, dtype: object 0 True 1 True 2 False Name: D, dtype: object Out[167]: A object B object C object D object dtype: object </code></pre> Why is this?, what I am doing wrong?, why the columns does not retain the original data types? Here's an approach that works and produced my desired output: (but for encapsulation reasons, I don't like it) <pre class="prettyprint"><code>def dtype_fn2(col_name): return(toydf[col_name].dtype) [dtype_fn2(col) for col in toydf.columns] Out[173]: [dtype('int64'), dtype('float64'), dtype('O'), dtype('bool')] </code></pre>

This comment is correct. This behaviour is by design. Pandas "applies" the type that is highest up in the type hierarchy for all dtypes given. Consider applying the function to only "A", <pre class="prettyprint"><code>df[['A']].apply(dtype_fn) int64 A int64 dtype: object </code></pre> And similarly, with only "A" and "B", <pre class="prettyprint"><code>df[['A', 'B']].apply(dtype_fn) float64 float64 A float64 B float64 dtype: object </code></pre> Since you have multiple types, including string in your original DataFrame, the common type for them all is <code>object</code>. <hr> Now this explains the behaviour, but I still need to address the fix. Pandas offers a useful method: <code>Series.infer_objects</code> which infers the dtype and performs a "soft conversion". If you really need the type in the function, you can perform a soft cast before calling <code>dtype</code>. This produces the expected result: <pre class="prettyprint"><code>def dtype_fn(the_col): the_col = the_col.infer_objects() print(the_col.dtype) return(the_col.dtype) </code></pre> <pre class="prettyprint"><code>df.apply(dtype_fn) int64 float64 object bool A int64 B float64 C object D bool dtype: object </code></pre>

The actual input to your <code>dtype_fn</code> is a Pandas Series object. You can access the underlying type by modifying your method slightly. <pre class="prettyprint"><code>def dtype_fn(the_col): print(the_col.values.dtype) return(the_col.values.dtype) </code></pre> For more info about why this is the case, you can have a look at this answer. There it says <blockquote> This is not an error but is due to the numpy dtype representation: https://docs.scipy.org/doc/numpy/reference/arrays.scalars.html. </blockquote>

Applying function to columns of a Pandas DataFrame, conditional on data type

Tags:

python

pandas

I want to use the common pattern to apply a function to every column in a Pandas DataFrame, but the function should work conditional on the column data type.

Sounds simple enough. But I found a weird behavior in testing for the data type and I cannot find anywhere in the docs or googling the reason for it.

Consider this repex:

import pandas as pd

toydf = pd.DataFrame(dict(
    A = [1, 2, 3],
    B = [1.1, 1.2, 1.3],
    C = ['1', '2', '3'],
    D = [True, True, False]
))

Checking individually the dtypes they are dtype('int64'), dtype('float64'), dtype('O'), dtype('bool')

But if I use the apply function, all columns passed to the function are dtype: object.

def dtype_fn(the_col):
    print(the_col)
    return(the_col.dtype)

toydf.apply(dtype_fn)

toydf.apply(dtype_fn)
0    1
1    2
2    3
Name: A, dtype: object
0    1.1
1    1.2
2    1.3
Name: B, dtype: object
0    1
1    2
2    3
Name: C, dtype: object
0     True
1     True
2    False
Name: D, dtype: object
Out[167]: 
A    object
B    object
C    object
D    object
dtype: object

Why is this?, what I am doing wrong?, why the columns does not retain the original data types?

Here's an approach that works and produced my desired output: (but for encapsulation reasons, I don't like it)

def dtype_fn2(col_name):
    return(toydf[col_name].dtype)

[dtype_fn2(col) for col in toydf.columns]

Out[173]: [dtype('int64'), dtype('float64'), dtype('O'), dtype('bool')]

575

asked Mar 15 '19 10:03

Hernando Casas

2 Answers

This comment is correct. This behaviour is by design. Pandas "applies" the type that is highest up in the type hierarchy for all dtypes given.

Consider applying the function to only "A",

df[['A']].apply(dtype_fn)
int64

A    int64
dtype: object

And similarly, with only "A" and "B",

df[['A', 'B']].apply(dtype_fn)
float64
float64

A    float64
B    float64
dtype: object

Since you have multiple types, including string in your original DataFrame, the common type for them all is object.

Now this explains the behaviour, but I still need to address the fix. Pandas offers a useful method: Series.infer_objects which infers the dtype and performs a "soft conversion".

If you really need the type in the function, you can perform a soft cast before calling dtype. This produces the expected result:

def dtype_fn(the_col):
     the_col = the_col.infer_objects()
     print(the_col.dtype)

     return(the_col.dtype)

df.apply(dtype_fn)
int64
float64
object
bool

A      int64
B    float64
C     object
D       bool
dtype: object

165

answered Oct 24 '22 23:10

cs95

The actual input to your dtype_fn is a Pandas Series object. You can access the underlying type by modifying your method slightly.

def dtype_fn(the_col):
    print(the_col.values.dtype)
    return(the_col.values.dtype)

For more info about why this is the case, you can have a look at this answer. There it says

This is not an error but is due to the numpy dtype representation: https://docs.scipy.org/doc/numpy/reference/arrays.scalars.html.

answered Oct 24 '22 23:10

Adam Fjeldsted

Related questions
                            
                                Set tkinter icon on Mac OS
                            
                                How to determine an overfitted model based on loss precision and recall
                            
                                Select top n TFIDF features for a given document
                            
                                Comparing Conv2D with padding between Tensorflow and PyTorch
                            
                                How to create a torchtext.data.TabularDataset directly from a list or dict
                            
                                TypeError: expected str, bytes or os.PathLike object, not _io.TextIOWrapper
                            
                                What is the purpose of "a and a or b"?
                            
                                Flask request.get_json() returns string not json
                            
                                Size mismatch for fc.bias and fc.weight in PyTorch
                            
                                Keras Embedding ,where is the "weights" argument?
                            
                                Pandas to Excel (Merged Header Column)
                            
                                tf.control_dependencies(tf.get_collection(tf.GraphKeys.UPDATE_OPS)) in tensorflow
                            
                                How to calculate the average of the most recent three non-nan value using Python
                            
                                Custom Scoring Function in sklearn Cross Validate
                            
                                Python numpy array negative indexing
                            
                                how to send email with python directly from server and without smtp
                            
                                Robust way to manage and kill any process
                            
                                Java Socket fails to connect to "0.0.0.0" with NoRouteToHostException instead of ConnectionRefused
                            
                                Converting spanish date into python pandas datetime object with locale setting
                            
                                What would be the pythonic way to go to prevent circular loop while writing JSON?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With