In a fictional patients dataset one might encounter the following table:
pd.DataFrame({
"Patients": ["Luke", "Nigel", "Sarah"],
"Disease": ["Cooties", "Dragon Pox", "Greycale & Cooties"]
})
Which renders the following dataset:
Now, assuming that the rows with multiple illnesses use the same pattern (separation with a character, in this context a &
) and that there exists a complete list diseases
of the illnesses, I've yet to find a simple solution to applying to these situations pandas.get_dummies
one-hot encoder to obtain a binary vector for each patient.
How can I obtain, in the simplest possible manner, the following binary vectorization from the initial DataFrame?
pd.DataFrame({
"Patients": ["Luke", "Nigel", "Sarah"],
"Cooties":[1, 0, 1],
"Dragon Pox":[0, 1, 0],
"Greyscale":[0, 0, 1]
})
Pandas VectorizationThe fastest way to work with Pandas and Numpy is to vectorize your functions. On the other hand, running functions element by element along an array or a series using for loops, list comprehension, or apply() is a bad practice.
apply is not faster in itself but it has advantages when used in combination with DataFrames. This depends on the content of the apply expression. If it can be executed in Cython space, apply is much faster (which is the case here).
Be aware of the multiple meanings of vectorization. In Pandas, it just means a batch API. Numeric code in Pandas often benefits from the second meaning of vectorization, a vastly faster native code loop. Vectorization in strings in Pandas can often be slower, since it doesn't use native code loops.
From what I measured (shown below in some experiments), using np. vectorize() is 25x faster (or more) than using the DataFrame function apply() , at least on my 2016 MacBook Pro.
You can use Series.str.get_dummies with right separator,
df.set_index('Patients')['Disease'].str.get_dummies(' & ').reset_index()
Patients Cooties Dragon Pox Greycale
0 Luke 1 0 0
1 Nigel 0 1 0
2 Sarah 1 0 1
We can unnest your string to rows using this function.
After that we use pivot_table
with aggfunc=len
:
df = explode_str(df, 'Disease', ' & ')
print(df)
Patients Disease
0 Luke Cooties
1 Nigel Dragon Pox
2 Sarah Greycale
2 Sarah Cooties
df.pivot_table(index='Patients', columns='Disease', aggfunc=len)\
.fillna(0).reset_index()
Disease Patients Cooties Dragon Pox Greycale
0 Luke 1.0 0.0 0.0
1 Nigel 0.0 1.0 0.0
2 Sarah 1.0 0.0 1.0
Function used from linked answer:
def explode_str(df, col, sep):
s = df[col]
i = np.arange(len(s)).repeat(s.str.count(sep) + 1)
return df.iloc[i].assign(**{col: sep.join(s).split(sep)})
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With