Pandas groupby with None

Tags:

pandas

pandas-groupby

For a dataframe df

df = pd.DataFrame({'id': ['1', '1', None, None, '1', '2', '2', '3', None, '4'], 
               'last_name': ['Clinton', 'Clinton', 'Clinton','Clinton', None, 'Bush', 'Bush', None, 'Obama', 'Obama'],
               'first_name': ['Bill', 'William', 'Bill', 'William', None, 'Georg W.', 'Georg', None, 'Barack', 'Barack']})

df['id'] = df['id'].astype('category')
print(df)

which gives the following table

    id last_name first_name
0    1   Clinton       Bill
1    1   Clinton    William
2  NaN   Clinton       Bill
3  NaN   Clinton    William
4    1       NaN       None
5    2      Bush   Georg W.
6    2      Bush      Georg
7    3       NaN       None
8  NaN     Obama     Barack
9    4     Obama     Barack

I want to group by the id and last_name, drop duplicates, and remove None iff there is more than one entry. So the output should be like

              first_name
id  last_name           
1   Clinton       Bill
    Clinton       William
2   Bush          Georg W.
    Bush          Georg
3   None          None
4   Obama         Barack

One of my problems is that groupby does not work, because it excludes the None / NaN values.

Any elegant ideas?

459

asked Sep 24 '18 14:09

Michael Dorner

1 Answers

IIUC, assuming your data frame has the structure similar to the one you posted, you can use ffill() and group by it, and then dropna only if len of each group is greater than 1.

df.groupby([df.id.ffill(), df.last_name.ffill()]).apply(lambda s: s.dropna() if len(s) > 1 else s).reset_index(drop=True)

    id  last_name   first_name  id2
0   1   Clinton     Bill        1
1   1   Clinton     William     1
2   2   Bush        Georg W.    2
3   2   Bush        Georg       2
4   3   None        None        3
5   NaN Obama       Barack      3

194

answered Oct 03 '22 07:10

rafaelc

Related questions
                            
                                Count unique dates in pandas dataframe
                            
                                How to change multiple Pandas DF columns to categorical without a loop
                            
                                feather data storage library for python 'module' object has no attribute 'write_dataframe' error
                            
                                Pandas: Difference to previous value
                            
                                Overlaying vertical lines onto a plot in Altair
                            
                                Combine Sklearn TFIDF with Additional Data
                            
                                NaN values in pivot_table index causes loss of data
                            
                                Grouped bar chart from two pandas data frames
                            
                                Web Scraping with Selenium Python [Twitter + Instagram]
                            
                                Pandas - Getting a Key Error when the Key Exists
                            
                                Pandas resample data frame with fixed number of rows
                            
                                Pandas Grouper by weekday?
                            
                                AttributeError:'list' object has no attribute 'size'
                            
                                Matplotlib: How to skip a range of hours when plotting with a datetime axis?
                            
                                What does the `overwrite` parameter in Pandas DataFrame.update() function do?
                            
                                efficiently extract rows from a pandas DataFrame ignoring missing index labels
                            
                                finding intersection of intervals in pandas
                            
                                Replicating rows in pandas dataframe by column value and add a new column with repetition index
                            
                                Why does isinstance return the wrong value only inside a series map?
                            
                                How can I remove sharp jumps in data?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With