I'm running into a weird problem where using the apply
function row-wise on a dataframe doesn't preserve the datatypes of the values in the dataframe. Is there a way to apply a function row-wise on a dataframe that preserves the original datatypes?
The code below demonstrates this problem. Without the int(...)
conversion within the format
function below, there would be an error because the int from the dataframe was converted to a float when passed into func
.
import pandas as pd
df = pd.DataFrame({'int_col': [1, 2], 'float_col': [1.23, 4.56]})
print(df)
print(df.dtypes)
def func(int_and_float):
int_val, float_val = int_and_float
print('int_val type:', type(int_val))
print('float_val type:', type(float_val))
return 'int-{:03d}_float-{:5.3f}'.format(int(int_val), float_val)
df['string_col'] = df[['int_col', 'float_col']].apply(func, axis=1)
print(df)
Here is the output from running the above code:
float_col int_col
0 1.23 1
1 4.56 2
float_col float64
int_col int64
dtype: object
int_val type: <class 'numpy.float64'>
float_val type: <class 'numpy.float64'>
int_val type: <class 'numpy.float64'>
float_val type: <class 'numpy.float64'>
float_col int_col string_col
0 1.23 1 int-001_float-1.230
1 4.56 2 int-002_float-4.560
Notice that even though the int_col
column of df
has dtype int64
, when values from that column get passed into function func
, they suddenly have dtype numpy.float64
, and I have to use int(...)
in the last line of the function to convert back, otherwise that line would give an error.
I can deal with this problem the way I have here if necessary, but I'd really like to understand why I'm seeing this unexpected behavior.
Answer. Yes, by default, concatenating dataframes will preserve their row order.
There are two ways to store text data in pandas: object -dtype NumPy array. StringDtype extension type.
In order to apply a function to every row, you should use axis=1 param to apply(). By applying a function to each row, we can create a new column by using the values from the row, updating the row e.t.c. Note that by default it uses axis=0 meaning it applies a function to each column.
What is happening is when you do apply(axis=1), your input row gets passed as a pandas series. And, in pandas, a series has one dtype. Since your row has both integers and floats, the entire series gets casted to float.
import pandas as pd
df = pd.DataFrame({'int_col': [1, 2], 'float_col': [1.23, 4.56]})
def func(int_and_float):
int_val, float_val = int_and_float
print('\n')
print('Prints input series')
print(int_and_float)
print('\n')
return 'int-{:03d}_float-{:5.3f}'.format(int(int_val), float_val)
df['string_col'] = df[['int_col', 'float_col']].apply(func, axis=1)
Output:
Prints input series
int_col 1.00
float_col 1.23
Name: 0, dtype: float64
Prints input series
int_col 2.00
float_col 4.56
Name: 1, dtype: float64
Your ints are getting upcasted into floats. Pandas (and NumPy) will try to make a Series (or ndarray) into a single data type if possible. As far as I know, the exact rules for upcasting are not documented, but you can see how different types will be upcasted by using numpy.find_common_type
.
You can trick Pandas and NumPy into keeping the original data types by casting the DataFrame as type "Object" before calling apply, like this:
df['string_col'] = df[['int_col', 'float_col']].astype('O').apply(func, axis=1)
Let's break down what is happening here. First, what happens to df after we do .astype('O')
?
as_object = df[['int_col', 'float_col']].astype('O')
print(as_object.dtypes)
Gives:
int_col object
float_col object
dtype: object
Okay so now both columns have the same dtype, which is object. We know from before that apply()
(or anything else that extracts one row from a DataFrame) will try to convert both columns to the same dtype, but it will see that they are already the same, so there is nothing to do.
However, we are still able to get the original ints and floats because dtype('O')
behaves as some sort of container type that can hold any python object. Typically it is used when a Series contains types that aren't meant to be mixed (like strings and ints) or any python object that NumPy doesn't understand.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With