Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I preserve datatype when using apply row-wise in pandas dataframe?

Tags:

python

pandas

I'm running into a weird problem where using the apply function row-wise on a dataframe doesn't preserve the datatypes of the values in the dataframe. Is there a way to apply a function row-wise on a dataframe that preserves the original datatypes?

The code below demonstrates this problem. Without the int(...) conversion within the format function below, there would be an error because the int from the dataframe was converted to a float when passed into func.

import pandas as pd

df = pd.DataFrame({'int_col': [1, 2], 'float_col': [1.23, 4.56]})
print(df)
print(df.dtypes)

def func(int_and_float):
    int_val, float_val = int_and_float
    print('int_val type:', type(int_val))
    print('float_val type:', type(float_val))
    return 'int-{:03d}_float-{:5.3f}'.format(int(int_val), float_val)

df['string_col'] = df[['int_col', 'float_col']].apply(func, axis=1)
print(df)

Here is the output from running the above code:

   float_col  int_col
0       1.23        1
1       4.56        2
float_col    float64
int_col        int64
dtype: object
int_val type: <class 'numpy.float64'>
float_val type: <class 'numpy.float64'>
int_val type: <class 'numpy.float64'>
float_val type: <class 'numpy.float64'>
   float_col  int_col           string_col
0       1.23        1  int-001_float-1.230
1       4.56        2  int-002_float-4.560

Notice that even though the int_col column of df has dtype int64, when values from that column get passed into function func, they suddenly have dtype numpy.float64, and I have to use int(...) in the last line of the function to convert back, otherwise that line would give an error.

I can deal with this problem the way I have here if necessary, but I'd really like to understand why I'm seeing this unexpected behavior.

like image 254
Ben Lindsay Avatar asked Nov 06 '17 18:11

Ben Lindsay


People also ask

Does pandas preserve row order?

Answer. Yes, by default, concatenating dataframes will preserve their row order.

What are the ways to store text data in pandas?

There are two ways to store text data in pandas: object -dtype NumPy array. StringDtype extension type.

How do I apply a row to a function in pandas?

In order to apply a function to every row, you should use axis=1 param to apply(). By applying a function to each row, we can create a new column by using the values from the row, updating the row e.t.c. Note that by default it uses axis=0 meaning it applies a function to each column.


2 Answers

What is happening is when you do apply(axis=1), your input row gets passed as a pandas series. And, in pandas, a series has one dtype. Since your row has both integers and floats, the entire series gets casted to float.

import pandas as pd

df = pd.DataFrame({'int_col': [1, 2], 'float_col': [1.23, 4.56]})

def func(int_and_float):
    int_val, float_val = int_and_float
    print('\n')
    print('Prints input series')
    print(int_and_float)
    print('\n')
    return 'int-{:03d}_float-{:5.3f}'.format(int(int_val), float_val)

df['string_col'] = df[['int_col', 'float_col']].apply(func, axis=1)

Output:

Prints input series
int_col      1.00
float_col    1.23
Name: 0, dtype: float64




Prints input series
int_col      2.00
float_col    4.56
Name: 1, dtype: float64
like image 55
Scott Boston Avatar answered Sep 29 '22 11:09

Scott Boston


Your ints are getting upcasted into floats. Pandas (and NumPy) will try to make a Series (or ndarray) into a single data type if possible. As far as I know, the exact rules for upcasting are not documented, but you can see how different types will be upcasted by using numpy.find_common_type.

You can trick Pandas and NumPy into keeping the original data types by casting the DataFrame as type "Object" before calling apply, like this:

df['string_col'] = df[['int_col', 'float_col']].astype('O').apply(func, axis=1)

Let's break down what is happening here. First, what happens to df after we do .astype('O')?

as_object = df[['int_col', 'float_col']].astype('O')
print(as_object.dtypes)

Gives:

int_col      object
float_col    object
dtype: object

Okay so now both columns have the same dtype, which is object. We know from before that apply() (or anything else that extracts one row from a DataFrame) will try to convert both columns to the same dtype, but it will see that they are already the same, so there is nothing to do.

However, we are still able to get the original ints and floats because dtype('O') behaves as some sort of container type that can hold any python object. Typically it is used when a Series contains types that aren't meant to be mixed (like strings and ints) or any python object that NumPy doesn't understand.

like image 34
Michael Avatar answered Sep 29 '22 10:09

Michael