In a fictional patients dataset one might encounter the following table: <pre class="prettyprint lang-py prettyprint-override"><code>pd.DataFrame({ "Patients": ["Luke", "Nigel", "Sarah"], "Disease": ["Cooties", "Dragon Pox", "Greycale & Cooties"] }) </code></pre> Which renders the following dataset: <img src="https://i.stack.imgur.com/k8kay.png" alt="Fictional diseases"> Now, assuming that the rows with multiple illnesses use the same pattern (separation with a character, in this context a <code>&</code>) and that there exists a complete list <code>diseases</code> of the illnesses, I've yet to find a simple solution to applying to these situations <code>pandas.get_dummies</code> one-hot encoder to obtain a binary vector for each patient. How can I obtain, in the simplest possible manner, the following binary vectorization from the initial DataFrame? <pre class="prettyprint lang-py prettyprint-override"><code>pd.DataFrame({ "Patients": ["Luke", "Nigel", "Sarah"], "Cooties":[1, 0, 1], "Dragon Pox":[0, 1, 0], "Greyscale":[0, 0, 1] }) </code></pre> <img src="https://i.stack.imgur.com/5ZRQJ.png" alt="Desired result">

You can use Series.str.get_dummies with right separator, <pre class="prettyprint"><code>df.set_index('Patients')['Disease'].str.get_dummies(' & ').reset_index() Patients Cooties Dragon Pox Greycale 0 Luke 1 0 0 1 Nigel 0 1 0 2 Sarah 1 0 1 </code></pre>

Binary-vectorize pandas DataFrame column

Tags:

python

pandas

dataframe

In a fictional patients dataset one might encounter the following table:

pd.DataFrame({
    "Patients": ["Luke", "Nigel", "Sarah"],
    "Disease": ["Cooties", "Dragon Pox", "Greycale & Cooties"]
})

Which renders the following dataset:

Fictional diseases

Now, assuming that the rows with multiple illnesses use the same pattern (separation with a character, in this context a &) and that there exists a complete list diseases of the illnesses, I've yet to find a simple solution to applying to these situations pandas.get_dummies one-hot encoder to obtain a binary vector for each patient.

How can I obtain, in the simplest possible manner, the following binary vectorization from the initial DataFrame?

pd.DataFrame({
    "Patients": ["Luke", "Nigel", "Sarah"],
    "Cooties":[1, 0, 1],
    "Dragon Pox":[0, 1, 0],
    "Greyscale":[0, 0, 1]
})

Desired result

502

asked May 12 '19 13:05

Luca Cappelletti

Video Answer

2 Answers

You can use Series.str.get_dummies with right separator,

df.set_index('Patients')['Disease'].str.get_dummies(' & ').reset_index()

    Patients    Cooties Dragon Pox  Greycale
0   Luke        1       0           0
1   Nigel       0       1           0
2   Sarah       1       0           1

147

answered Oct 17 '22 06:10

Vaishali

We can unnest your string to rows using this function.

After that we use pivot_table with aggfunc=len:

df = explode_str(df, 'Disease', ' & ')

print(df)
  Patients     Disease
0     Luke     Cooties
1    Nigel  Dragon Pox
2    Sarah    Greycale
2    Sarah     Cooties

df.pivot_table(index='Patients', columns='Disease', aggfunc=len)\
  .fillna(0).reset_index()

Disease Patients  Cooties  Dragon Pox  Greycale
0           Luke      1.0         0.0       0.0
1          Nigel      0.0         1.0       0.0
2          Sarah      1.0         0.0       1.0

Function used from linked answer:

def explode_str(df, col, sep):
    s = df[col]
    i = np.arange(len(s)).repeat(s.str.count(sep) + 1)
    return df.iloc[i].assign(**{col: sep.join(s).split(sep)})

answered Oct 17 '22 04:10

Erfan

Related questions
                            
                                Web scraping google flight prices
                            
                                xgboost predict_proba : How to do the mapping between the probabilities and the labels
                            
                                How can I read pickle file containing pandas data frame from qrc resource file with pandas read_pickle?
                            
                                Is it possible to use a custom filter function in pandas?
                            
                                pandas: Fill missing dates when keeping duplicates
                            
                                Pandas DataFrame: mean of column B values within column A windows
                            
                                Django update_or_create (get part) using related object as kwarg
                            
                                How to put multiple colormap patches in a matplotlib legend?
                            
                                Convert UTC timestamp to local timezone issue in pandas
                            
                                How to import one databricks notebook into another?
                            
                                Joining Two Different Dataframes on Timestamp
                            
                                Calculating Rolling forward averages with pandas
                            
                                How to validate html forms in python Flask?
                            
                                Why is there so much speed difference between these two variants?
                            
                                Extracting parts of array repeatedly
                            
                                when extending python with c, how do one cope with arbitrary size integers?
                            
                                How to create a tree from a list of subtrees?
                            
                                What is the best way to run python scripts in AWS?
                            
                                Why is my Flask error handler not being called?
                            
                                Overhead of python multiprocessing initialization is worse than benefits

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With