Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Drop duplicates from level in hierarchical index pandas

Tags:

python

pandas

I want to de-duplicate the following hierarchical indexed dataframe based off the second index. I haven't been able to find a way of doing this. there is a pandas.Multiindex.drop_duplicates() but it doesn't allow you to specify level.

An example dataframe is:

In [5]: df
Out[5]:
               given_name  surname  dob  phone_number_1_clean 
985    2414           1.0      1.0  0.0                   1.0
       122864         1.0      1.0  0.0                   0.0
       167863         1.0      1.0  0.0                   0.0
       418911         1.0      1.0  0.0                   0.0
       516362         1.0      1.0  0.0                   0.0
2414   122864         1.0      1.0  0.0                   0.0
       167863         1.0      1.0  1.0                   0.0
       418911         1.0      1.0  1.0                   0.0
       516362         1.0      1.0  0.0                   0.0
122864 167863         1.0      1.0  0.0                   1.0
       418911         1.0      1.0  0.0                   1.0
       516362         1.0      1.0  0.0                   1.0
167863 418911         1.0      1.0  1.0                   1.0
       516362         1.0      1.0  0.0                   1.0
418911 516362         1.0      1.0  0.0                   1.0

The output should look:

               given_name  surname  dob  phone_number_1_clean 
985    2414           1.0      1.0  0.0                   1.0
       122864         1.0      1.0  0.0                   0.0
       167863         1.0      1.0  0.0                   0.0
       418911         1.0      1.0  0.0                   0.0
       516362         1.0      1.0  0.0                   0.0
like image 499
Auren Ferguson Avatar asked Jan 03 '23 03:01

Auren Ferguson


1 Answers

Use get_level_values for select second level of MultiIndex with duplicated for boolean mask, invert condition and filter by boolean indexing:

df = df[~df.index.get_level_values(1).duplicated()]
print (df)
            given_name  surname  dob  phone_number_1_clean
985 2414           1.0      1.0  0.0                   1.0
    122864         1.0      1.0  0.0                   0.0
    167863         1.0      1.0  0.0                   0.0
    418911         1.0      1.0  0.0                   0.0
    516362         1.0      1.0  0.0                   0.0

Detail:

print (df.index.get_level_values(1))
Int64Index([  2414, 122864, 167863, 418911, 516362, 122864, 167863, 418911,
            516362, 167863, 418911, 516362, 418911, 516362, 516362],
           dtype='int64')

print (df.index.get_level_values(1).duplicated())
[False False False False False  True  True  True  True  True  True  True
  True  True  True]

print (~df.index.get_level_values(1).duplicated())
[ True  True  True  True  True False False False False False False False
 False False False]
like image 110
jezrael Avatar answered Jan 05 '23 17:01

jezrael