Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dropping duplicate observations with more missing values

Tags:

python

pandas

I have duplicated values in my df, however one of the observation has a lot of nans. I would like to keep the duplicated value that has the lowest missing value.

Any ideas how to do this ?

This is an example of my df:

id    B    C    D
1     2    3    4
1     .    3    4
1     .    .    4
2     9    7    .
2     9    .    8
2     9    7    8
2     .    .    .

In this example I would like to keep only the first observation and the 6th.

Thanks

like image 277
Pierrot75 Avatar asked Apr 22 '26 02:04

Pierrot75


2 Answers

You could use df.isna().sum(axis=1) to count the amount of NaNs by row, and then GroupBy id and select the row with less NaNs using idxmin:

df.loc[df.isna().sum(axis=1).groupby(df.id).idxmin(),:]

   id    B    C    D
0   1  2.0  3.0  4.0
5   2  9.0  7.0  8.0

Make sure the missing values are NaNs as you specified, otherwise start with:

df.replace('.',np.nan)
like image 186
yatu Avatar answered Apr 23 '26 17:04

yatu


A different approach that does more than what you asked for. This is if some values are missing in one row and different are missing in another row and you want to combine these to get more complete information:

df = pd.DataFrame({"id": [1, 1, 1, 2, 2, 2], "B": [2, np.nan, np.nan, np.nan, np.nan, 9], "C": [3, 3, np.nan, 7, np.nan, np.nan], "D": [4, 4, 4, np.nan, 8, np.nan]})
#      B    C    D  id
# 0  NaN  3.0  4.0   1
# 1  NaN  3.0  NaN   1
# 2  NaN  NaN  4.0   1
# 3  NaN  7.0  NaN   2
# 4  NaN  NaN  8.0   2
# 5  9.0  NaN  NaN   2

df.groupby("id", as_index=False).fillna(method="bfill").drop_duplicates(subset="id")
#      B    C    D  id
# 0  NaN  3.0  4.0   1
# 3  9.0  7.0  8.0   2

Note that the example df is slightly different than in your question so to show off where this approach would be better.

For id 1 it is the same as just taking the first row. But for id 2 it is actually able to fill out all values, where your (or the other answers) would take just one row, all of which suck.

Obviously this assumes that the values that are not NaN stay the same. If they do not, only the first occurrences of NaN in that column will be taken.


Edit:

In newer pandas version (at least 1.4.2), fillna seems to do weird things when being applied to the grouped dataframe and drops the id column. You can circumvent this by using apply:

df.groupby("id", as_index=False)\
  .apply(lambda s: s.fillna(method="bfill"))\
  .drop_duplicates(subset="id")
like image 31
Graipher Avatar answered Apr 23 '26 17:04

Graipher



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!