I have a dataframe that includes columns with multiple attributes separated by commas:
df = pd.DataFrame({'id': [1,2,3], 'labels' : ["a,b,c", "c,a", "d,a,b"]})
id labels
0 1 a,b,c
1 2 c,a
2 3 d,a,b
(I know this isn't an ideal situation, but the data originates from an external source.) I want to turn the multi-attribute columns into multiple columns, one for each label, so that I can treat them as categorical variables. Desired output:
id a b c d
0 1 True True True False
1 2 True False True False
2 3 True True False True
I can get the set of all possible attributes ([a,b,c,d]
) fairly easily, but cannot figure out a way to determine whether a given row has a particular attribute without row-by-row iteration for each attribute. Is there a better way to do this?
You can use get_dummies
, cast 1
and 0
to boolean
by astype
and last concat
column id
:
print df['labels'].str.get_dummies(sep=',').astype(bool)
a b c d
0 True True True False
1 True False True False
2 True True False True
print pd.concat([df.id, df['labels'].str.get_dummies(sep=',').astype(bool)], axis=1)
id a b c d
0 1 True True True False
1 2 True False True False
2 3 True True False True
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With