Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Prevent coercion of pandas data frames while indexing and inserting rows

I'm working with individual rows of pandas data frames, but I'm stumbling over coercion issues while indexing and inserting rows. Pandas seems to always want to coerce from a mixed int/float to all-float types, and I can't see any obvious controls on this behaviour.

For example, here is a simple data frame with a as int and b as float:

import pandas as pd
pd.__version__  # '0.25.2'

df = pd.DataFrame({'a': [1], 'b': [2.2]})
print(df)
#    a    b
# 0  1  2.2
print(df.dtypes)
# a      int64
# b    float64
# dtype: object

Here is a coercion issue while indexing one row:

print(df.loc[0])
# a    1.0
# b    2.2
# Name: 0, dtype: float64
print(dict(df.loc[0]))
# {'a': 1.0, 'b': 2.2}

And here is a coercion issue while inserting one row:

df.loc[1] = {'a': 5, 'b': 4.4}
print(df)
#      a    b
# 0  1.0  2.2
# 1  5.0  4.4
print(df.dtypes)
# a    float64
# b    float64
# dtype: object

In both instances, I want the a column to remain as an integer type, rather than being coerced to a float type.

like image 307
Mike T Avatar asked Oct 23 '19 23:10

Mike T


People also ask

What is the most efficient way to loop through Dataframes with pandas?

Vectorization is always the first and best choice. You can convert the data frame to NumPy array or into dictionary format to speed up the iteration workflow. Iterating through the key-value pair of dictionaries comes out to be the fastest way with around 280x times speed up for 20 million records.

What does coerce do in pandas?

Code #3: Using errors='coerce'. It will replace all non-numeric values with NaN.

How do I force a panda to display all rows?

A function set_option() is provided by pandas to display all rows of the data frame. display. max_rows represents the maximum number of rows that pandas will display while displaying a data frame. The default value of max_rows is 10.


3 Answers

After some digging, here are some terribly ugly workarounds. (A better answer will be accepted.)

A quirk found here is that non-numeric columns stops coercion, so here is how to index one row to a dict:

dict(df.assign(_='').loc[0].drop('_', axis=0))
# {'a': 1, 'b': 2.2}

And inserting a row can be done by creating a new data frame with one row:

df = df.append(pd.DataFrame({'a': 5, 'b': 4.4}, index=[1]))
print(df)
#    a    b
# 0  1  2.2
# 1  5  4.4

Both of these tricks are not optimised for large data frames, so I would greatly appreciate a better answer!

like image 102
Mike T Avatar answered Oct 19 '22 17:10

Mike T


Whenever you are getting data from dataframe or appending data to a dataframe and need to keep the data type same, avoid conversion to other internal structures which are not aware of the data types needed.

When you do df.loc[0] it converts to pd.Series,

>>> type(df.loc[0])
<class 'pandas.core.series.Series'>

And now, Series will only have a single dtype. Thus coercing int to float.

Instead keep structure as pd.DataFrame,

>>> type(df.loc[[0]])
<class 'pandas.core.frame.DataFrame'>

Select row needed as a frame and then convert to dict

>>> df.loc[[0]].to_dict(orient='records')
[{'a': 1, 'b': 2.2}]

Similarly, to add a new row, Use pandas pd.DataFrame.append function,

>>> df = df.append([{'a': 5, 'b': 4.4}]) # NOTE: To append as a row, use []
   a    b
0  1  2.2
0  5  4.4

The above will not cause type conversion,

>>> df.dtypes
a      int64
b    float64
dtype: object
like image 3
Vishnudev Avatar answered Oct 19 '22 17:10

Vishnudev


The root of the problem is that

  1. The indexing of pandas dataframe returns a pandas series

We can see that:

type(df.loc[0])
# pandas.core.series.Series

And a series can only have one dtype, in your case either int64 or float64.

There are two workarounds come to my head:

print(df.loc[[0]])
# this will return a dataframe instead of series
# so the result will be
#    a    b
# 0  1  2.2

# but the dictionary is hard to read
print(dict(df.loc[[0]]))
# {'a': 0    1
# Name: a, dtype: int64, 'b': 0    2.2
# Name: b, dtype: float64}

or

print(df.astype(object).loc[0])
# this will change the type of value to object first and then print
# so the result will be
# a      1
# b    2.2
# Name: 0, dtype: object

print(dict(df.astype(object).loc[0]))
# in this way the dictionary is as expected
# {'a': 1, 'b': 2.2}
  1. When you append a dictionary to a dataframe, it will convert the dictionary to a Series first and then append. (So the same problem happens again)

https://github.com/pandas-dev/pandas/blob/master/pandas/core/frame.py#L6973

if isinstance(other, dict):
    other = Series(other)

So your walkaround is actually a solid one, or else we could:

df.append(pd.Series({'a': 5, 'b': 4.4}, dtype=object, name=1))
#    a    b
# 0  1  2.2
# 1  5  4.4
like image 2
Hongpei Avatar answered Oct 19 '22 19:10

Hongpei