Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Quickest way to make a get_dummies type dataframe from a column with a multiple of strings

I have a column, 'col2', that has a list of strings. The current code I have is too slow, there's about 2000 unique strings (the letters in the example below), and 4000 rows. Ending up as 2000 columns and 4000 rows.

In [268]: df.head()
Out[268]:
    col1    col2
0   6       A,B
1   15      C,G,A
2   25      B

Is there a fast way to make this in a get dummies format? Where each string has it's own column and in each string's column there is a 0 or 1 if it that row has that string in col2.

In [268]: def get_list(df):
d = []
for row in df.col2:
    row_list = row.split(',')
    for string in row_list:
        if string not in d:
            d.append(string)
return d

df_list = get_list(df)

def make_cols(df, lst):
    for string in lst:
        df[string] = 0
    return df

df = make_cols(df, df_list)


for idx in range(0, len(df['col2'])):
    row_list = df['col2'].iloc[idx].split(',')
    for string in row_list:
        df[string].iloc[idx]+= 1

Out[113]:
col1    col2    A   B   C   G
0   6   A,B     1   1   0   0
1   15  C,G,A   1   0   1   1
2   25  B       0   1   0   0

This is my current code for it but it's too slow.

Thanks you any help!

like image 366
David Feldman Avatar asked Jan 24 '15 02:01

David Feldman


People also ask

What does Drop_first do in Get_dummies?

drop_first. The drop_first parameter specifies whether or not you want to drop the first category of the categorical variable you're encoding. By default, this is set to drop_first = False . This will cause get_dummies to create one dummy variable for every level of the input categorical variable.

Is Get_dummies same as one hot encoding?

One-hot Encoder is a popular feature encoding strategy that performs similar to pd. get_dummies() with added advantages. It encodes a nominal or categorical feature by assigning one binary column per category per categorical feature. Scikit-learn comes with the implementation of the one-hot encoder.

What's the use of Pandas Get_dummies () method?

get_dummies() is used for data manipulation. It converts categorical data into dummy or indicator variables.

How do I create a dummy variable in multiple columns in Python?

We can create dummy variables in python using get_dummies() method. Parameters: data= input data i.e. it includes pandas data frame. list .


1 Answers

You can use:

>>> df['col2'].str.get_dummies(sep=',')
   A  B  C  G
0  1  1  0  0
1  1  0  1  1
2  0  1  0  0

To join the Dataframes:

>>> pd.concat([df, df['col2'].str.get_dummies(sep=',')], axis=1)
   col1   col2  A  B  C  G
0     6    A,B  1  1  0  0
1    15  C,G,A  1  0  1  1
2    25      B  0  1  0  0
like image 85
elyase Avatar answered Oct 24 '22 10:10

elyase