Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas groupby with None

For a dataframe df

df = pd.DataFrame({'id': ['1', '1', None, None, '1', '2', '2', '3', None, '4'], 
               'last_name': ['Clinton', 'Clinton', 'Clinton','Clinton', None, 'Bush', 'Bush', None, 'Obama', 'Obama'],
               'first_name': ['Bill', 'William', 'Bill', 'William', None, 'Georg W.', 'Georg', None, 'Barack', 'Barack']})

df['id'] = df['id'].astype('category')
print(df)

which gives the following table

    id last_name first_name
0    1   Clinton       Bill
1    1   Clinton    William
2  NaN   Clinton       Bill
3  NaN   Clinton    William
4    1       NaN       None
5    2      Bush   Georg W.
6    2      Bush      Georg
7    3       NaN       None
8  NaN     Obama     Barack
9    4     Obama     Barack

I want to group by the id and last_name, drop duplicates, and remove None iff there is more than one entry. So the output should be like

              first_name
id  last_name           
1   Clinton       Bill
    Clinton       William
2   Bush          Georg W.
    Bush          Georg
3   None          None
4   Obama         Barack

One of my problems is that groupby does not work, because it excludes the None / NaN values.

Any elegant ideas?

like image 459
Michael Dorner Avatar asked Sep 24 '18 14:09

Michael Dorner


People also ask

How to group data in pandas?

The abstract definition of grouping is to provide a mapping of labels to group names. Pandas datasets can be split into any of their objects. There are multiple ways to split data like: Note : In this we refer to the grouping objects as the keys. In order to group data with one key, we pass only one key as an argument in groupby function.

What is groupby in pandas Dataframe?

Pandas groupby is used for grouping the data according to the categories and apply a function to the categories. It also helps to aggregate data efficiently. Pandas dataframe.groupby () function is used to split the data into groups based on some criteria. pandas objects can be split on any of their axes.

Do you call Dir () on a pandas groupby object?

If you call dir () on a Pandas GroupBy object, then you’ll see enough methods there to make your head spin! It can be hard to keep track of all of the functionality of a Pandas GroupBy object. One way to clear the fog is to compartmentalize the different methods into what they do and how they behave.

What is the difference between groupby () and apply () pattern in pandas?

To get some background information, check out How to Speed Up Your Pandas Projects. What may happen with .apply () is that it will effectively perform a Python loop over each group. While the .groupby (...).apply () pattern can provide some flexibility, it can also inhibit Pandas from otherwise using its Cython-based optimizations.


1 Answers

IIUC, assuming your data frame has the structure similar to the one you posted, you can use ffill() and group by it, and then dropna only if len of each group is greater than 1.

df.groupby([df.id.ffill(), df.last_name.ffill()]).apply(lambda s: s.dropna() if len(s) > 1 else s).reset_index(drop=True)

    id  last_name   first_name  id2
0   1   Clinton     Bill        1
1   1   Clinton     William     1
2   2   Bush        Georg W.    2
3   2   Bush        Georg       2
4   3   None        None        3
5   NaN Obama       Barack      3
like image 194
rafaelc Avatar answered Oct 03 '22 07:10

rafaelc