I have a question regarding splitting a list in a dataframe column into multiple columns. But every value that is splitted needs to be placed in a specific column.
Let's say I have this Dataframe:
date data
2020-01-01 00:00:00 [G07, G08, G10, G16]
2020-01-01 00:00:01 [G07, G08, G16]
2020-01-01 00:00:02 [G08, G10, G16, G20, G21]
2020-01-01 00:00:03 [G16, G20, G21, G26, G27, R02]
2020-01-01 00:00:04 [G07, G08, G26, G27]
And I'm looking for this kind of result:
date G07 G08 G10 G16 G20 G21 G26 G27 R02
2020-01-01 00:00:00 G07 G08 G10 G16 NaN NaN NaN NaN NaN
2020-01-01 00:00:01 G07 G08 NaN G16 NaN NaN NaN NaN NaN
2020-01-01 00:00:02 NaN G08 G10 G16 G20 G21 NaN NaN NaN
2020-01-01 00:00:03 NaN NaN NaN G16 G20 G21 G26 G27 R02
2020-01-01 00:00:04 G07 G08 NaN NaN NaN NaN G26 G27 NaN
To finally get this kind of matrix:
date G07 G08 G10 G16 G20 G21 G26 G27 R02
2020-01-01 00:00:00 1 1 1 1 0 0 0 0 0
2020-01-01 00:00:01 1 1 0 1 0 0 0 0 0
2020-01-01 00:00:02 0 1 1 1 1 1 0 0 0
2020-01-01 00:00:03 0 0 0 1 1 1 1 1 1
2020-01-01 00:00:04 1 1 0 0 0 0 1 1 0
By doing this type of command :
In [1] pd.DataFrame(self.df['data'].to_list())
Out [1] date 1 2 3 4 5 6
2020-01-01 00:00:00 G07 G08 G10 G16
2020-01-01 00:00:01 G07 G08 G16
2020-01-01 00:00:02 G08 G10 G16 G20 G21
2020-01-01 00:00:03 G16 G20 G21 G26 G27 R02
2020-01-01 00:00:04 G07 G08 G26 G27
I'm only allowed to split the list into other columns. But I cannot find a way to place each value into a specific column.
I've been thinking of making loops over each values of each dates but it is very slow and I have datasets that are more than 1,000,000 rows.
Check with MultiLabelBinarizer
from sklearn
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
s = pd.DataFrame(mlb.fit_transform(df['data']),columns=mlb.classes_, index=df.index)
df = df.join(s)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With