I have the following df
:
df = pd.DataFrame(
[
[["John Muller"], "person", [8866155845]],
[["Innovation Division"], "company", np.nan],
[["Carol Sway"], "person", [8866155845]],
],
columns=["name", "kind", "phone"],
)
# Out:
# name kind phone
# 0 [John Muller] person [8866155845]
# 1 [Innovation Division] company NaN
# 2 [Carol Sway] person [8866155845]
and I want to find duplicates of a phone number. But the objects in df
are lists, so using:
df.duplicated('phone')
will generate the error:
TypeError: unhashable type: 'list'
You can also use applymap
function which is quite handy to solve this problem:
# get duplicated row
df2 = df[df.applymap(lambda x: x[0] if isinstance(x, list) else x).duplicated('phone')]
print(df2)
name kind phone
2 [Carol Sway] person [8866155845]
You will be surprised that pd.DataFrame.duplicated works differently when compared to pd.Series.duplicated. You are right that df.duplicated("phone")
will throw TypeError, but using df.phone.duplicated()
directly will succeed.
df[df.phone.duplicated()] # or df[df["phone"].duplicated()]
# name kind phone
# 2 [Carol Sway] person [8866155845]
Another simple and useful way, how to deal with list objects in DataFrames, is using explode method which is transforming list-like elements to a row (but be aware it replicates index). You could use it in a following manner:
df_exploded = df.explode("phone")
df_exploded[df_exploded.duplicated("phone")]
# name kind phone
# 2 [Carol Sway] person 8866155845
Or if you are only interested in duplicated phone numbers, you can then do something like df["phone"].explode().value_counts()
to see how many times are particular numbers duplicated.
Use can use the hashable_df package:
from hashable_df import hashable_df
hashable_df(df).duplicated('phone')
This will make all unhashable cell values hashable and these kind of operations to work.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With