Currently I've the next dataframe:
import pandas as pd
df= pd.DataFrame({"ID" : ['1','2','3','4','5'],
"col2" : [['a', 'b', 'c'],
['c', 'd', 'e', 'f'],
['f', 'b', 'f'],
['a', 'c', 'b'],
['b', 'a', 'b']]})
print(df)
ID col2
0 1 [a, b, c]
1 2 [c, d, e, f]
2 3 [f, b, f]
3 4 [a, c, b]
4 5 [b, a, d]
I want to create a new dataframe with dummies for col2, like this:
ID a b c d e f
0 1 1 1 1 0 0 0
1 2 0 0 1 1 1 1
2 3 0 1 0 0 0 1
3 4 1 1 1 0 0 0
4 5 1 1 0 1 0 0
Using the following code generates different columns for each of the letters in the column list:
df2= df.col2.str.get_dummies(sep = ",")
pd.concat([data['col1'], df], axis=1)
ID a b b] c c] d d] e f] [a [b [c [f
1 0 1 0 0 1 0 0 0 0 1 0 0 0
2 0 0 0 0 0 1 0 1 1 0 0 1 0
3 0 1 0 0 0 0 0 0 1 0 0 0 1
4 0 0 1 1 0 0 0 0 0 1 0 0 0
5 1 0 0 0 0 0 1 0 0 0 1 0 0
Using the following code generates different columns for each of the letters in the list of the column according to the position in which they are. Does any of you have any idea why you might be going through this? The pd.get_dummies
option also doesn't work.
str.get_dummies
works well on strings so you can turn your list into a something-separated-string and use str_get_dummies
on that string. For example,
df['col2'].str.join('@').str.get_dummies('@')
Out:
a b c d e f
0 1 1 1 0 0 0
1 0 0 1 1 1 1
2 0 1 0 0 0 1
3 1 1 1 0 0 0
4 1 1 0 0 0 0
Here, @
is an arbitrary character that does not appear in the list.
Then, you can concat as usual:
pd.concat([df['ID'], df['col2'].str.join('@').str.get_dummies('@')], axis=1)
Out:
ID a b c d e f
0 1 1 1 1 0 0 0
1 2 0 0 1 1 1 1
2 3 0 1 0 0 0 1
3 4 1 1 1 0 0 0
4 5 1 1 0 0 0 0
Using comprehension dicts could be faster
In [40]: pd.DataFrame({k: 1 for k in x} for x in df.col2.values).fillna(0).astype(int)
Out[40]:
a b c d e f
0 1 1 1 0 0 0
1 0 0 1 1 1 1
2 0 1 0 0 0 1
3 1 1 1 0 0 0
4 1 1 0 0 0 0
In [48]: pd.concat([
df['ID'],
pd.DataFrame({k: 1 for k in x} for x in df.col2).fillna(0).astype(int)],
axis=1)
Out[48]:
ID a b c d e f
0 1 1 1 1 0 0 0
1 2 0 0 1 1 1 1
2 3 0 1 0 0 0 1
3 4 1 1 1 0 0 0
4 5 1 1 0 0 0 0
Timings
In [2942]: df.shape
Out[2942]: (50000, 2)
In [2945]: %timeit pd.DataFrame({k: 1 for k in x} for x in df.col2).fillna(0).astype(int)
10 loops, best of 3: 137 ms per loop
In [2946]: %timeit df['col2'].str.join('@').str.get_dummies('@')
1 loop, best of 3: 395 ms per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With