Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do you One Hot Encode columns with a list of strings as values?

I'm basically trying to one hot encode a column with values like this:

  tickers
1 [DIS]
2 [AAPL,AMZN,BABA,BAY]
3 [MCDO,PEP]
4 [ABT,ADBE,AMGN,CVS]
5 [ABT,CVS,DIS,ECL,EMR,FAST,GE,GOOGL]
...

First I got all the set of all the tickers(which is about 467 tickers):

all_tickers = list()
for tickers in df.tickers:
    for ticker in tickers:
        all_tickers.append(ticker)
all_tickers = set(all_tickers)

Then I implemented One Hot Encoding this way:

for i in range(len(df.index)):
    for ticker in all_tickers:
        if ticker in df.iloc[i]['tickers']:
            df.at[i+1, ticker] = 1
        else:
            df.at[i+1, ticker] = 0

The problem is the script runs incredibly slow when processing about 5000+ rows. How can I improve my algorithm?

like image 993
Castle Avatar asked Dec 13 '17 06:12

Castle


People also ask

How do you one-hot encode the column?

In this technique, the categorical parameters will prepare separate columns for both Male and Female labels. So, wherever there is Male, the value will be 1 in Male column and 0 in Female column, and vice-versa.

What is the simplest way to one-hot encode the data if you are not using SciKit learn?

from sklearn. preprocessing import OneHotEncoder >>> enc = OneHotEncoder() >>> enc. fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]]) OneHotEncoder(categorical_features='all', dtype=<class 'numpy. float64'>, handle_unknown='error', n_values='auto', sparse=True) >>> enc.

What is the difference between OneHotEncoder and LabelEncoder?

As you can see, we have three new columns with 1s and 0s, depending on the country that the rows represent. So, that's the difference between Label Encoding and One Hot Encoding. Follow me on Twitter for more Data Science, Machine Learning, and general tech updates.


2 Answers

I think you need str.join with str.get_dummies:

df = df['tickers'].str.join('|').str.get_dummies()

Or:

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()

df = pd.DataFrame(mlb.fit_transform(df['tickers']),columns=mlb.classes_, index=df.index)
print (df)
   AAPL  ABT  ADBE  AMGN  AMZN  BABA  BAY  CVS  DIS  ECL  EMR  FAST  GE  \
1     0    0     0     0     0     0    0    0    1    0    0     0   0   
2     1    0     0     0     1     1    1    0    0    0    0     0   0   
3     0    0     0     0     0     0    0    0    0    0    0     0   0   
4     0    1     1     1     0     0    0    1    0    0    0     0   0   
5     0    1     0     0     0     0    0    1    1    1    1     1   1   

   GOOGL  MCDO  PEP  
1      0     0    0  
2      0     0    0  
3      0     1    1  
4      0     0    0  
5      1     0    0  
like image 154
jezrael Avatar answered Sep 22 '22 20:09

jezrael


You can use apply(pd.Series) and then get_dummies():

df = pd.DataFrame({"tickers":[["DIS"], ["AAPL","AMZN","BABA","BAY"], 
                              ["MCDO","PEP"], ["ABT","ADBE","AMGN","CVS"], 
                              ["ABT","CVS","DIS","ECL","EMR","FAST","GE","GOOGL"]]})

pd.get_dummies(df.tickers.apply(pd.Series), prefix="", prefix_sep="")

   AAPL  ABT  DIS  MCDO  ADBE  AMZN  CVS  PEP  AMGN  BABA  DIS  BAY  CVS  ECL  \
0     0    0    1     0     0     0    0    0     0     0    0    0    0    0   
1     1    0    0     0     0     1    0    0     0     1    0    1    0    0   
2     0    0    0     1     0     0    0    1     0     0    0    0    0    0   
3     0    1    0     0     1     0    0    0     1     0    0    0    1    0   
4     0    1    0     0     0     0    1    0     0     0    1    0    0    1   

   EMR  FAST  GE  GOOGL  
0    0     0   0      0  
1    0     0   0      0  
2    0     0   0      0  
3    0     0   0      0  
4    1     1   1      1  
like image 42
andrew_reece Avatar answered Sep 22 '22 20:09

andrew_reece