I'm working with individual rows of pandas data frames, but I'm stumbling over coercion issues while indexing and inserting rows. Pandas seems to always want to coerce from a mixed int/float to all-float types, and I can't see any obvious controls on this behaviour.
For example, here is a simple data frame with a
as int
and b
as float
:
import pandas as pd
pd.__version__ # '0.25.2'
df = pd.DataFrame({'a': [1], 'b': [2.2]})
print(df)
# a b
# 0 1 2.2
print(df.dtypes)
# a int64
# b float64
# dtype: object
Here is a coercion issue while indexing one row:
print(df.loc[0])
# a 1.0
# b 2.2
# Name: 0, dtype: float64
print(dict(df.loc[0]))
# {'a': 1.0, 'b': 2.2}
And here is a coercion issue while inserting one row:
df.loc[1] = {'a': 5, 'b': 4.4}
print(df)
# a b
# 0 1.0 2.2
# 1 5.0 4.4
print(df.dtypes)
# a float64
# b float64
# dtype: object
In both instances, I want the a
column to remain as an integer type, rather than being coerced to a float type.
Vectorization is always the first and best choice. You can convert the data frame to NumPy array or into dictionary format to speed up the iteration workflow. Iterating through the key-value pair of dictionaries comes out to be the fastest way with around 280x times speed up for 20 million records.
Code #3: Using errors='coerce'. It will replace all non-numeric values with NaN.
A function set_option() is provided by pandas to display all rows of the data frame. display. max_rows represents the maximum number of rows that pandas will display while displaying a data frame. The default value of max_rows is 10.
After some digging, here are some terribly ugly workarounds. (A better answer will be accepted.)
A quirk found here is that non-numeric columns stops coercion, so here is how to index one row to a dict
:
dict(df.assign(_='').loc[0].drop('_', axis=0))
# {'a': 1, 'b': 2.2}
And inserting a row can be done by creating a new data frame with one row:
df = df.append(pd.DataFrame({'a': 5, 'b': 4.4}, index=[1]))
print(df)
# a b
# 0 1 2.2
# 1 5 4.4
Both of these tricks are not optimised for large data frames, so I would greatly appreciate a better answer!
Whenever you are getting data from dataframe or appending data to a dataframe and need to keep the data type same, avoid conversion to other internal structures which are not aware of the data types needed.
When you do df.loc[0]
it converts to pd.Series
,
>>> type(df.loc[0])
<class 'pandas.core.series.Series'>
And now, Series
will only have a single dtype
. Thus coercing int
to float
.
Instead keep structure as pd.DataFrame
,
>>> type(df.loc[[0]])
<class 'pandas.core.frame.DataFrame'>
Select row needed as a frame and then convert to dict
>>> df.loc[[0]].to_dict(orient='records')
[{'a': 1, 'b': 2.2}]
Similarly, to add a new row, Use pandas pd.DataFrame.append
function,
>>> df = df.append([{'a': 5, 'b': 4.4}]) # NOTE: To append as a row, use []
a b
0 1 2.2
0 5 4.4
The above will not cause type conversion,
>>> df.dtypes
a int64
b float64
dtype: object
The root of the problem is that
We can see that:
type(df.loc[0])
# pandas.core.series.Series
And a series can only have one dtype, in your case either int64 or float64.
There are two workarounds come to my head:
print(df.loc[[0]])
# this will return a dataframe instead of series
# so the result will be
# a b
# 0 1 2.2
# but the dictionary is hard to read
print(dict(df.loc[[0]]))
# {'a': 0 1
# Name: a, dtype: int64, 'b': 0 2.2
# Name: b, dtype: float64}
or
print(df.astype(object).loc[0])
# this will change the type of value to object first and then print
# so the result will be
# a 1
# b 2.2
# Name: 0, dtype: object
print(dict(df.astype(object).loc[0]))
# in this way the dictionary is as expected
# {'a': 1, 'b': 2.2}
https://github.com/pandas-dev/pandas/blob/master/pandas/core/frame.py#L6973
if isinstance(other, dict):
other = Series(other)
So your walkaround is actually a solid one, or else we could:
df.append(pd.Series({'a': 5, 'b': 4.4}, dtype=object, name=1))
# a b
# 0 1 2.2
# 1 5 4.4
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With