Dropping duplicate observations with more missing values

Question

I have duplicated values in my df, however one of the observation has a lot of nans. I would like to keep the duplicated value that has the lowest missing value.

Any ideas how to do this ?

This is an example of my df:

id    B    C    D
1     2    3    4
1     .    3    4
1     .    .    4
2     9    7    .
2     9    .    8
2     9    7    8
2     .    .    .

In this example I would like to keep only the first observation and the 6th.

Thanks

yatu · Accepted Answer

You could use df.isna().sum(axis=1) to count the amount of NaNs by row, and then GroupBy id and select the row with less NaNs using idxmin:

df.loc[df.isna().sum(axis=1).groupby(df.id).idxmin(),:]

   id    B    C    D
0   1  2.0  3.0  4.0
5   2  9.0  7.0  8.0

Make sure the missing values are NaNs as you specified, otherwise start with:

df.replace('.',np.nan)

Graipher · Answer

A different approach that does more than what you asked for. This is if some values are missing in one row and different are missing in another row and you want to combine these to get more complete information:

df = pd.DataFrame({"id": [1, 1, 1, 2, 2, 2], "B": [2, np.nan, np.nan, np.nan, np.nan, 9], "C": [3, 3, np.nan, 7, np.nan, np.nan], "D": [4, 4, 4, np.nan, 8, np.nan]})
#      B    C    D  id
# 0  NaN  3.0  4.0   1
# 1  NaN  3.0  NaN   1
# 2  NaN  NaN  4.0   1
# 3  NaN  7.0  NaN   2
# 4  NaN  NaN  8.0   2
# 5  9.0  NaN  NaN   2

df.groupby("id", as_index=False).fillna(method="bfill").drop_duplicates(subset="id")
#      B    C    D  id
# 0  NaN  3.0  4.0   1
# 3  9.0  7.0  8.0   2

Note that the example df is slightly different than in your question so to show off where this approach would be better.

For id 1 it is the same as just taking the first row. But for id 2 it is actually able to fill out all values, where your (or the other answers) would take just one row, all of which suck.

Obviously this assumes that the values that are not NaN stay the same. If they do not, only the first occurrences of NaN in that column will be taken.

Edit:

In newer pandas version (at least 1.4.2), fillna seems to do weird things when being applied to the grouped dataframe and drops the id column. You can circumvent this by using apply:

df.groupby("id", as_index=False)\
  .apply(lambda s: s.fillna(method="bfill"))\
  .drop_duplicates(subset="id")

Dropping duplicate observations with more missing values

Tags:

python

pandas

Pierrot75

2 Answers

yatu

Graipher

Recent Activity

Donate For Us

Dropping duplicate observations with more missing values

Tags:

python

pandas

Pierrot75

2 Answers

yatu

Graipher

Related questions

Recent Activity

Donate For Us