I am working with a dataframe that has multiple columns, and I wish to find the unique values of select columns and replaced them with another list values.
So for example, this is my dataframe:
import pandas as pd
data = {'col1': ["Bruce Wayne", "Clark Kent", "Peter Parker"],
'col2': ["Alfred Pennyworth", "Bruce Wayne", "Clark Kent"]}
df = pd.DataFrame(data=data)
# col1 col2
# 0 Bruce Wayne Alfred Pennyworth
# 1 Clark Kent Bruce Wayne
# 2 Peter Parker Clark Kent
And I have the following list of values that I want to replace the unique values in my dataframe:
AlternativeNames = ["Batman", "Superman", "Spiderman", "Batman's butler"]
So the output will be:
col1 col2
0 Batman Batman's butler
1 Superman Batman
2 Spiderman Spiderman
You can assume the order does not matter. So if Clark Kent gets mapped to Batman, it is fine. However, the consistency of the mapping is important, so if Clark Kent gets mapped to Batman, it has to be applied everywhere.
I know how to get unique values of multiple columns, and I know about pd.factorize(); however, in this case I have a reference list, and I am not sure how to replace values according to the reference list.
You can use the pandas Categorical data type:
df = df.stack().astype('category')
df.cat.categories = ["Batman", "Superman", "Spiderman", "Batman's butler"]
df = df.unstack()
col1 col2
0 Superman Batman
1 Spiderman Superman
2 Batman's butler Spiderman
Alternatively, shorter but harder to read:
alt = ["Batman", "Superman", "Spiderman", "Batman's butler"]
df.replace(dict(zip(df.stack().astype('category').cat.categories, alt)))
col1 col2
0 Superman Batman
1 Spiderman Superman
2 Batman's butler Spiderman
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With