Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Binary-vectorize pandas DataFrame column

In a fictional patients dataset one might encounter the following table:

pd.DataFrame({
    "Patients": ["Luke", "Nigel", "Sarah"],
    "Disease": ["Cooties", "Dragon Pox", "Greycale & Cooties"]
})

Which renders the following dataset:

Fictional diseases

Now, assuming that the rows with multiple illnesses use the same pattern (separation with a character, in this context a &) and that there exists a complete list diseases of the illnesses, I've yet to find a simple solution to applying to these situations pandas.get_dummies one-hot encoder to obtain a binary vector for each patient.

How can I obtain, in the simplest possible manner, the following binary vectorization from the initial DataFrame?

pd.DataFrame({
    "Patients": ["Luke", "Nigel", "Sarah"],
    "Cooties":[1, 0, 1],
    "Dragon Pox":[0, 1, 0],
    "Greyscale":[0, 0, 1]
})

Desired result

like image 502
Luca Cappelletti Avatar asked May 12 '19 13:05

Luca Cappelletti


People also ask

Does pandas apply use vectorization?

Pandas VectorizationThe fastest way to work with Pandas and Numpy is to vectorize your functions. On the other hand, running functions element by element along an array or a series using for loops, list comprehension, or apply() is a bad practice.

Is DF apply faster than for loop?

apply is not faster in itself but it has advantages when used in combination with DataFrames. This depends on the content of the apply expression. If it can be executed in Cython space, apply is much faster (which is the case here).

What is Vectorisation in pandas?

Be aware of the multiple meanings of vectorization. In Pandas, it just means a batch API. Numeric code in Pandas often benefits from the second meaning of vectorization, a vastly faster native code loop. Vectorization in strings in Pandas can often be slower, since it doesn't use native code loops.

Is NP vectorize faster than apply?

From what I measured (shown below in some experiments), using np. vectorize() is 25x faster (or more) than using the DataFrame function apply() , at least on my 2016 MacBook Pro.


Video Answer


2 Answers

You can use Series.str.get_dummies with right separator,

df.set_index('Patients')['Disease'].str.get_dummies(' & ').reset_index()

    Patients    Cooties Dragon Pox  Greycale
0   Luke        1       0           0
1   Nigel       0       1           0
2   Sarah       1       0           1
like image 147
Vaishali Avatar answered Oct 17 '22 06:10

Vaishali


We can unnest your string to rows using this function.

After that we use pivot_table with aggfunc=len:

df = explode_str(df, 'Disease', ' & ')

print(df)
  Patients     Disease
0     Luke     Cooties
1    Nigel  Dragon Pox
2    Sarah    Greycale
2    Sarah     Cooties

df.pivot_table(index='Patients', columns='Disease', aggfunc=len)\
  .fillna(0).reset_index()

Disease Patients  Cooties  Dragon Pox  Greycale
0           Luke      1.0         0.0       0.0
1          Nigel      0.0         1.0       0.0
2          Sarah      1.0         0.0       1.0

Function used from linked answer:

def explode_str(df, col, sep):
    s = df[col]
    i = np.arange(len(s)).repeat(s.str.count(sep) + 1)
    return df.iloc[i].assign(**{col: sep.join(s).split(sep)})
like image 2
Erfan Avatar answered Oct 17 '22 04:10

Erfan