I have a Pandas Dataframe that looks something like this:
text = ["abcd", "efgh", "ijkl", "mnop", "qrst", "uvwx", "yz"]
labels = ["label_1, label_2",
"label_1, label_3, label_2",
"label_2, label_4",
"label_1, label_2, label_5",
"label_2, label_3",
"label_3, label_5, label_1, label_2",
"label_1, label_3"]
df = pd.DataFrame(dict(text=text, labels=labels))
df
text labels
0 abcd label_1, label_2
1 efgh label_1, label_3, label_2
2 ijkl label_2, label_4
3 mnop label_1, label_2, label_5
4 qrst label_2, label_3
5 uvwx label_3, label_5, label_1, label_2
6 yz label_1, label_3
I would like to format the dataframe into something like this:
text label_1 label_2 label_3 label_4 label_5
abcd 1.0 1.0 0.0 0.0 0.0
efgh 1.0 1.0 1.0 0.0 0.0
ijkl 0.0 1.0 0.0 1.0 0.0
mnop 1.0 1.0 0.0 0.0 1.0
qrst 0.0 1.0 1.0 0.0 0.0
uvwx 1.0 1.0 1.0 0.0 1.0
yz 1.0 0.0 1.0 0.0 0.0
How can I accomplish this?
(I know I can split the strings in the labels and convert them into lists by doing something like df.labels.str.split(",")
but not sure as to how to proceed from there.
(so basically I'd like to convert those keywords in the labels columns into its own columns and fill in 1 whenever they appear as shown in expected output)
You can use pd.Series.str.get_dummies
and combine with the text
series:
dummies = df['labels'].str.replace(' ', '').str.get_dummies(',')
res = df['text'].to_frame().join(dummies)
print(res)
text label_1 label_2 label_3 label_4 label_5
0 abcd 1 1 0 0 0
1 efgh 1 1 1 0 0
2 ijkl 0 1 0 1 0
3 mnop 1 1 0 0 1
4 qrst 0 1 1 0 0
5 uvwx 1 1 1 0 1
6 yz 1 0 1 0 0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With