I want to de-duplicate the following hierarchical indexed dataframe based off the second index. I haven't been able to find a way of doing this. there is a pandas.Multiindex.drop_duplicates()
but it doesn't allow you to specify level.
An example dataframe is:
In [5]: df
Out[5]:
given_name surname dob phone_number_1_clean
985 2414 1.0 1.0 0.0 1.0
122864 1.0 1.0 0.0 0.0
167863 1.0 1.0 0.0 0.0
418911 1.0 1.0 0.0 0.0
516362 1.0 1.0 0.0 0.0
2414 122864 1.0 1.0 0.0 0.0
167863 1.0 1.0 1.0 0.0
418911 1.0 1.0 1.0 0.0
516362 1.0 1.0 0.0 0.0
122864 167863 1.0 1.0 0.0 1.0
418911 1.0 1.0 0.0 1.0
516362 1.0 1.0 0.0 1.0
167863 418911 1.0 1.0 1.0 1.0
516362 1.0 1.0 0.0 1.0
418911 516362 1.0 1.0 0.0 1.0
The output should look:
given_name surname dob phone_number_1_clean
985 2414 1.0 1.0 0.0 1.0
122864 1.0 1.0 0.0 0.0
167863 1.0 1.0 0.0 0.0
418911 1.0 1.0 0.0 0.0
516362 1.0 1.0 0.0 0.0
Use get_level_values
for select second level of MultiIndex
with duplicated
for boolean mask, invert condition and filter by boolean indexing
:
df = df[~df.index.get_level_values(1).duplicated()]
print (df)
given_name surname dob phone_number_1_clean
985 2414 1.0 1.0 0.0 1.0
122864 1.0 1.0 0.0 0.0
167863 1.0 1.0 0.0 0.0
418911 1.0 1.0 0.0 0.0
516362 1.0 1.0 0.0 0.0
Detail:
print (df.index.get_level_values(1))
Int64Index([ 2414, 122864, 167863, 418911, 516362, 122864, 167863, 418911,
516362, 167863, 418911, 516362, 418911, 516362, 516362],
dtype='int64')
print (df.index.get_level_values(1).duplicated())
[False False False False False True True True True True True True
True True True]
print (~df.index.get_level_values(1).duplicated())
[ True True True True True False False False False False False False
False False False]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With