I'm running into a weird problem where using the <code>apply</code> function row-wise on a dataframe doesn't preserve the datatypes of the values in the dataframe. Is there a way to apply a function row-wise on a dataframe that preserves the original datatypes? The code below demonstrates this problem. Without the <code>int(...)</code> conversion within the <code>format</code> function below, there would be an error because the int from the dataframe was converted to a float when passed into <code>func</code>. <pre class="prettyprint"><code>import pandas as pd df = pd.DataFrame({'int_col': [1, 2], 'float_col': [1.23, 4.56]}) print(df) print(df.dtypes) def func(int_and_float): int_val, float_val = int_and_float print('int_val type:', type(int_val)) print('float_val type:', type(float_val)) return 'int-{:03d}_float-{:5.3f}'.format(int(int_val), float_val) df['string_col'] = df[['int_col', 'float_col']].apply(func, axis=1) print(df) </code></pre> Here is the output from running the above code: <pre class="prettyprint"><code> float_col int_col 0 1.23 1 1 4.56 2 float_col float64 int_col int64 dtype: object int_val type: <class 'numpy.float64'> float_val type: <class 'numpy.float64'> int_val type: <class 'numpy.float64'> float_val type: <class 'numpy.float64'> float_col int_col string_col 0 1.23 1 int-001_float-1.230 1 4.56 2 int-002_float-4.560 </code></pre> Notice that even though the <code>int_col</code> column of <code>df</code> has dtype <code>int64</code>, when values from that column get passed into function <code>func</code>, they suddenly have dtype <code>numpy.float64</code>, and I have to use <code>int(...)</code> in the last line of the function to convert back, otherwise that line would give an error. I can deal with this problem the way I have here if necessary, but I'd really like to understand why I'm seeing this unexpected behavior.

Your ints are getting upcasted into floats. Pandas (and NumPy) will try to make a Series (or ndarray) into a single data type if possible. As far as I know, the exact rules for upcasting are not documented, but you can see how different types will be upcasted by using <code>numpy.find_common_type</code>. You can trick Pandas and NumPy into keeping the original data types by casting the DataFrame as type "Object" before calling apply, like this: <pre class="prettyprint"><code>df['string_col'] = df[['int_col', 'float_col']].astype('O').apply(func, axis=1) </code></pre> <hr> Let's break down what is happening here. First, what happens to df after we do <code>.astype('O')</code>? <pre class="prettyprint"><code>as_object = df[['int_col', 'float_col']].astype('O') print(as_object.dtypes) </code></pre> Gives: <pre class="prettyprint"><code>int_col object float_col object dtype: object </code></pre> Okay so now both columns have the same dtype, which is object. We know from before that <code>apply()</code> (or anything else that extracts one row from a DataFrame) will try to convert both columns to the same dtype, but it will see that they are already the same, so there is nothing to do. However, we are still able to get the original ints and floats because <code>dtype('O')</code> behaves as some sort of container type that can hold any python object. Typically it is used when a Series contains types that aren't meant to be mixed (like strings and ints) or any python object that NumPy doesn't understand.

How do I preserve datatype when using apply row-wise in pandas dataframe?

Tags:

python

pandas

I'm running into a weird problem where using the apply function row-wise on a dataframe doesn't preserve the datatypes of the values in the dataframe. Is there a way to apply a function row-wise on a dataframe that preserves the original datatypes?

The code below demonstrates this problem. Without the int(...) conversion within the format function below, there would be an error because the int from the dataframe was converted to a float when passed into func.

import pandas as pd

df = pd.DataFrame({'int_col': [1, 2], 'float_col': [1.23, 4.56]})
print(df)
print(df.dtypes)

def func(int_and_float):
    int_val, float_val = int_and_float
    print('int_val type:', type(int_val))
    print('float_val type:', type(float_val))
    return 'int-{:03d}_float-{:5.3f}'.format(int(int_val), float_val)

df['string_col'] = df[['int_col', 'float_col']].apply(func, axis=1)
print(df)

Here is the output from running the above code:

   float_col  int_col
0       1.23        1
1       4.56        2
float_col    float64
int_col        int64
dtype: object
int_val type: <class 'numpy.float64'>
float_val type: <class 'numpy.float64'>
int_val type: <class 'numpy.float64'>
float_val type: <class 'numpy.float64'>
   float_col  int_col           string_col
0       1.23        1  int-001_float-1.230
1       4.56        2  int-002_float-4.560

Notice that even though the int_col column of df has dtype int64, when values from that column get passed into function func, they suddenly have dtype numpy.float64, and I have to use int(...) in the last line of the function to convert back, otherwise that line would give an error.

I can deal with this problem the way I have here if necessary, but I'd really like to understand why I'm seeing this unexpected behavior.

254

asked Nov 06 '17 18:11

Ben Lindsay

2 Answers

What is happening is when you do apply(axis=1), your input row gets passed as a pandas series. And, in pandas, a series has one dtype. Since your row has both integers and floats, the entire series gets casted to float.

import pandas as pd

df = pd.DataFrame({'int_col': [1, 2], 'float_col': [1.23, 4.56]})

def func(int_and_float):
    int_val, float_val = int_and_float
    print('\n')
    print('Prints input series')
    print(int_and_float)
    print('\n')
    return 'int-{:03d}_float-{:5.3f}'.format(int(int_val), float_val)

df['string_col'] = df[['int_col', 'float_col']].apply(func, axis=1)

Output:

Prints input series
int_col      1.00
float_col    1.23
Name: 0, dtype: float64




Prints input series
int_col      2.00
float_col    4.56
Name: 1, dtype: float64

answered Sep 29 '22 11:09

Scott Boston

Your ints are getting upcasted into floats. Pandas (and NumPy) will try to make a Series (or ndarray) into a single data type if possible. As far as I know, the exact rules for upcasting are not documented, but you can see how different types will be upcasted by using numpy.find_common_type.

You can trick Pandas and NumPy into keeping the original data types by casting the DataFrame as type "Object" before calling apply, like this:

df['string_col'] = df[['int_col', 'float_col']].astype('O').apply(func, axis=1)

Let's break down what is happening here. First, what happens to df after we do .astype('O')?

as_object = df[['int_col', 'float_col']].astype('O')
print(as_object.dtypes)

Gives:

int_col      object
float_col    object
dtype: object

Okay so now both columns have the same dtype, which is object. We know from before that apply() (or anything else that extracts one row from a DataFrame) will try to convert both columns to the same dtype, but it will see that they are already the same, so there is nothing to do.

However, we are still able to get the original ints and floats because dtype('O') behaves as some sort of container type that can hold any python object. Typically it is used when a Series contains types that aren't meant to be mixed (like strings and ints) or any python object that NumPy doesn't understand.

answered Sep 29 '22 10:09

Michael

Related questions
                            
                                Annotate Outliers on Seaborn Jointplot
                            
                                if statement without a condition
                            
                                How to append a single labeled tick to x-axis using matplotlib?
                            
                                Matplotlib Scatter plot change color based on value on list
                            
                                Python open jp2 medical images - Scipy, glymur
                            
                                Transforming multiindex to row-wise multi-dimensional NumPy array.
                            
                                How to mock a list in Python?
                            
                                How to adjust table for a plot? More space for table and graph matplotlib python
                            
                                Pandas DataFrame RangeIndex
                            
                                How can I import LambdaContext?
                            
                                How to set splash timeout in scrapy-splash?
                            
                                Is it possible to mock the builtin len() function in Python 3.6?
                            
                                Does Python asyncio use a thread pool?
                            
                                What's the most elegant way to convert requests' response to DRF response in Django?
                            
                                How do I convert a string to a Python Decimal in German locale (with comma instead of a point)
                            
                                cx_Oracle.DatabaseError: DPI-1047: 64-bit Oracle Client library cannot be loaded: "dlopen(libclntsh.dylib, 1): image not found"
                            
                                In Factory Boy, how to join strings created with Faker?
                            
                                Is it a good practice to use serializer as query parameters validators?
                            
                                Pandas Merge row data with multiple values to Python list for a column
                            
                                Seemingly infinite recursion with generator based coroutines

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With