I have a pandas dataFrame which contains list of variables which I want to convert to dummy variables. Basically I want to convert:

to this:

df = pd.DataFrame({0: [['hello', 'motto'], ['motto', 'mania']]})
print(df)
                0
0  [hello, motto]
1  [motto, mania]
use str.join followed by str.get_dummies
df[0].str.join('|').str.get_dummies()
   hello  mania  motto
0      1      0      1
1      0      1      1
Here is a memory saving solution which is going to use sparse matrixes and Pandas.SparseSeries:
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
X = vect.fit_transform(df.pop(0).str.join(' '))
for i, col in enumerate(vect.get_feature_names()):
    df[col] = pd.SparseSeries(X[:, i].toarray().ravel(), fill_value=0)
Result:
In [81]: df
Out[81]:
   hello  mania  motto
0      1      0      1
1      0      1      1
In [82]: df.memory_usage()
Out[82]:
Index    80
hello     8   # notice memory usage: # of ones multiplied by 8 bytes (int64)
mania     8
motto    16
dtype: int64
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With