I am using pandas.cut() on dataframe columns with nans. I need to run groupby on the output of pandas.cut(), so I need to convert nans to something else (in the output, not in the input data), otherwise groupby will stupidly and infuriatingly ignore them. I understand that cut() now outputs categorical data, but I cannot find a way to add a category to the output. I have tried add_categories(), which runs with no warning nor errors, but doesn't work because the categories are not added and, indeed, fillna fails for this very reason. A minimalist example is below. Any ideas? Or is there maybe an easy way to convert this categorical object to a non-categorical one? I have tried np.asarray() but with no luck - it becomes an array containing an Interval object <pre class="prettyprint"><code>import pandas as pd import numpy as np x=[np.nan,4,6] intervals =[-np.inf,4,np.inf] out_nolabels=pd.cut(x,intervals) out_labels=pd.cut(x,intervals, labels=['<=4','>4']) out_nolabels.add_categories(['missing']) out_labels.add_categories(['missing']) print(out_labels) print(out_nolabels) out_labels=out_labels.fillna('missing') out_nolabels=out_nolabels.fillna('missing') </code></pre>

As the documentation say out of the bounds data will be consider as Na categorical object, so you cant use fillna's with some constant in categorical data <code>since the new value you are filling is not in that categories</code> <blockquote> Any NA values will be NA in the result. Out of bounds values will be NA in the resulting Categorical object </blockquote> You cant use <code>x.fillna('missing')</code> because <code>missing</code> is not in the category of <code>x</code> but you can do <code>x.fillna('>4')</code> because <code>>4</code> is in the category. We can use np.where here to overcome that <pre class="prettyprint"><code>x = pd.cut(df['id'],intervals, labels=['<=4','>4']) np.where(x.isnull(),'missing',x) array(['<=4', '<=4', '<=4', '<=4', 'missing', 'missing'], dtype=object) </code></pre> Or <code>add_categories</code> to the values i.e <pre class="prettyprint"><code>x = pd.cut(df['id'],intervals, labels=['<=4','>4']).values.add_categories('missing') x.fillna('missing') [<=4, <=4, <=4, <=4, missing, missing] Categories (3, object): [<=4 < >4 < missing] </code></pre> If you want to group nan's and keep the dtype one way of doing it is by casting it to str i.e If you have a dataframe <pre class="prettyprint"><code>df = pd.DataFrame({'id':[1,1,1,4,np.nan,np.nan],'value':[4,5,6,7,8,1]}) df.groupby(df.id.astype(str)).mean() </code></pre> Output : <pre class="prettyprint"> id value id 1.0 1.0 5.0 4.0 4.0 7.0 nan NaN 4.5 </pre>

pandas cut(): how to convert nans? Or to convert the output to non-categorical?

Tags:

python

pandas

categorical-data

I am using pandas.cut() on dataframe columns with nans. I need to run groupby on the output of pandas.cut(), so I need to convert nans to something else (in the output, not in the input data), otherwise groupby will stupidly and infuriatingly ignore them.

I understand that cut() now outputs categorical data, but I cannot find a way to add a category to the output. I have tried add_categories(), which runs with no warning nor errors, but doesn't work because the categories are not added and, indeed, fillna fails for this very reason. A minimalist example is below.

Any ideas?

Or is there maybe an easy way to convert this categorical object to a non-categorical one? I have tried np.asarray() but with no luck - it becomes an array containing an Interval object

import pandas as pd
import numpy as np

x=[np.nan,4,6]
intervals =[-np.inf,4,np.inf]
out_nolabels=pd.cut(x,intervals)
out_labels=pd.cut(x,intervals, labels=['<=4','>4'])
out_nolabels.add_categories(['missing'])
out_labels.add_categories(['missing'])

print(out_labels)
print(out_nolabels)

out_labels=out_labels.fillna('missing')
out_nolabels=out_nolabels.fillna('missing')

653

asked Nov 01 '17 11:11

Pythonista anonymous

1 Answers

As the documentation say out of the bounds data will be consider as Na categorical object, so you cant use fillna's with some constant in categorical data since the new value you are filling is not in that categories

Any NA values will be NA in the result. Out of bounds values will be NA in the resulting Categorical object

You cant use x.fillna('missing') because missing is not in the category of x but you can do x.fillna('>4') because >4 is in the category.

We can use np.where here to overcome that

x = pd.cut(df['id'],intervals, labels=['<=4','>4'])

np.where(x.isnull(),'missing',x)
array(['<=4', '<=4', '<=4', '<=4', 'missing', 'missing'], dtype=object)

Or add_categories to the values i.e

x = pd.cut(df['id'],intervals, labels=['<=4','>4']).values.add_categories('missing')
x.fillna('missing')

[<=4, <=4, <=4, <=4, missing, missing]
Categories (3, object): [<=4 < >4 < missing]

If you want to group nan's and keep the dtype one way of doing it is by casting it to str i.e If you have a dataframe

df = pd.DataFrame({'id':[1,1,1,4,np.nan,np.nan],'value':[4,5,6,7,8,1]})

df.groupby(df.id.astype(str)).mean()

Output :

     id  value
id             
1.0  1.0    5.0
4.0  4.0    7.0
nan  NaN    4.5

154

answered Sep 20 '22 23:09

Bharath

Related questions
                            
                                Calling a lambda with a numpy array
                            
                                Reading PCAP file with scapy
                            
                                Add 15 minutes to current timestamp using timedelta
                            
                                Keras - All layer names should be unique
                            
                                Django 1.10 & Socket.IO with Python 3
                            
                                Changing class attributes by reference
                            
                                Can I change the way keys are compared in a Python dict? I want to use the operator 'is' instead of ==
                            
                                Including data files with setup.py
                            
                                async - sync - async calls in one python event loop
                            
                                RectangleSelector Disappears on Zoom
                            
                                How can I prevent TfidfVectorizer to get numbers as vocabulary
                            
                                Pandas DataFrame : How to select rows on multiple conditions?
                            
                                Tensorflow: access trained variables after closing the session
                            
                                How to catch custom exception in Python [duplicate]
                            
                                Using Tkinter in Jupyter Notebook
                            
                                Uniformly partition PySpark Dataframe by count of non-null elements in row
                            
                                What does async/await do?
                            
                                python - Convert Single integer into a list
                            
                                Use default Python while having Anaconda
                            
                                How to implement custom layer with multiple input in Keras

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With