For a dataframe df
df = pd.DataFrame({'id': ['1', '1', None, None, '1', '2', '2', '3', None, '4'],
'last_name': ['Clinton', 'Clinton', 'Clinton','Clinton', None, 'Bush', 'Bush', None, 'Obama', 'Obama'],
'first_name': ['Bill', 'William', 'Bill', 'William', None, 'Georg W.', 'Georg', None, 'Barack', 'Barack']})
df['id'] = df['id'].astype('category')
print(df)
which gives the following table
id last_name first_name 0 1 Clinton Bill 1 1 Clinton William 2 NaN Clinton Bill 3 NaN Clinton William 4 1 NaN None 5 2 Bush Georg W. 6 2 Bush Georg 7 3 NaN None 8 NaN Obama Barack 9 4 Obama Barack
I want to group by the id
and last_name
, drop duplicates, and remove None
iff there is more than one entry. So the output should be like
first_name id last_name 1 Clinton Bill Clinton William 2 Bush Georg W. Bush Georg 3 None None 4 Obama Barack
One of my problems is that groupby
does not work, because it excludes the None
/ NaN
values.
Any elegant ideas?
The abstract definition of grouping is to provide a mapping of labels to group names. Pandas datasets can be split into any of their objects. There are multiple ways to split data like: Note : In this we refer to the grouping objects as the keys. In order to group data with one key, we pass only one key as an argument in groupby function.
Pandas groupby is used for grouping the data according to the categories and apply a function to the categories. It also helps to aggregate data efficiently. Pandas dataframe.groupby () function is used to split the data into groups based on some criteria. pandas objects can be split on any of their axes.
If you call dir () on a Pandas GroupBy object, then you’ll see enough methods there to make your head spin! It can be hard to keep track of all of the functionality of a Pandas GroupBy object. One way to clear the fog is to compartmentalize the different methods into what they do and how they behave.
To get some background information, check out How to Speed Up Your Pandas Projects. What may happen with .apply () is that it will effectively perform a Python loop over each group. While the .groupby (...).apply () pattern can provide some flexibility, it can also inhibit Pandas from otherwise using its Cython-based optimizations.
IIUC, assuming your data frame has the structure similar to the one you posted, you can use ffill()
and group by it, and then dropna
only if len
of each group is greater than 1.
df.groupby([df.id.ffill(), df.last_name.ffill()]).apply(lambda s: s.dropna() if len(s) > 1 else s).reset_index(drop=True)
id last_name first_name id2
0 1 Clinton Bill 1
1 1 Clinton William 1
2 2 Bush Georg W. 2
3 2 Bush Georg 2
4 3 None None 3
5 NaN Obama Barack 3
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With