Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas cut(): how to convert nans? Or to convert the output to non-categorical?

I am using pandas.cut() on dataframe columns with nans. I need to run groupby on the output of pandas.cut(), so I need to convert nans to something else (in the output, not in the input data), otherwise groupby will stupidly and infuriatingly ignore them.

I understand that cut() now outputs categorical data, but I cannot find a way to add a category to the output. I have tried add_categories(), which runs with no warning nor errors, but doesn't work because the categories are not added and, indeed, fillna fails for this very reason. A minimalist example is below.

Any ideas?

Or is there maybe an easy way to convert this categorical object to a non-categorical one? I have tried np.asarray() but with no luck - it becomes an array containing an Interval object

import pandas as pd
import numpy as np

x=[np.nan,4,6]
intervals =[-np.inf,4,np.inf]
out_nolabels=pd.cut(x,intervals)
out_labels=pd.cut(x,intervals, labels=['<=4','>4'])
out_nolabels.add_categories(['missing'])
out_labels.add_categories(['missing'])

print(out_labels)
print(out_nolabels)

out_labels=out_labels.fillna('missing')
out_nolabels=out_nolabels.fillna('missing')
like image 653
Pythonista anonymous Avatar asked Nov 01 '17 11:11

Pythonista anonymous


People also ask

How do you change to categorical type in pandas?

astype() method is used to cast a pandas object to a specified dtype. astype() function also provides the capability to convert any suitable existing column to categorical type. DataFrame. astype() function comes very handy when we want to case a particular column data type to another data type.

What is PD cut in pandas?

Pandas cut() function is used to separate the array elements into different bins . The cut function is mainly used to perform statistical analysis on scalar data.


1 Answers

As the documentation say out of the bounds data will be consider as Na categorical object, so you cant use fillna's with some constant in categorical data since the new value you are filling is not in that categories

Any NA values will be NA in the result. Out of bounds values will be NA in the resulting Categorical object

You cant use x.fillna('missing') because missing is not in the category of x but you can do x.fillna('>4') because >4 is in the category.

We can use np.where here to overcome that

x = pd.cut(df['id'],intervals, labels=['<=4','>4'])

np.where(x.isnull(),'missing',x)
array(['<=4', '<=4', '<=4', '<=4', 'missing', 'missing'], dtype=object)

Or add_categories to the values i.e

x = pd.cut(df['id'],intervals, labels=['<=4','>4']).values.add_categories('missing')
x.fillna('missing')

[<=4, <=4, <=4, <=4, missing, missing]
Categories (3, object): [<=4 < >4 < missing]

If you want to group nan's and keep the dtype one way of doing it is by casting it to str i.e If you have a dataframe

df = pd.DataFrame({'id':[1,1,1,4,np.nan,np.nan],'value':[4,5,6,7,8,1]})

df.groupby(df.id.astype(str)).mean()

Output :

     id  value
id             
1.0  1.0    5.0
4.0  4.0    7.0
nan  NaN    4.5
like image 154
Bharath Avatar answered Sep 20 '22 23:09

Bharath