I am working on a data manipulation exercise, where the original dataset looks like;
df = pd.DataFrame({
'x1': [1, 2, 3, 4, 5],
'x2': [2, -7, 4, 3, 2],
'a': [0, 1, 0, 1, 1],
'b': [0, 1, 1, 0, 0],
'c': [0, 1, 1, 1, 1],
'd': [0, 0, 1, 0, 1]})
Here the columns a
,b
,c
are categories whereas x
,x2
are features. The goal is to convert this dataset into following format;
dfnew1 = pd.DataFrame({
'x1': [1, 2,2,2, 3,3,3, 4,4, 5,5,5],
'x2': [2, -7,-7,-7, 4,4,4, 3,3, 2,2,2],
'a': [0, 1,0,0, 0,0,0, 1,0,1,0,0],
'b': [0, 0,1,0, 1,0,0,0, 0, 0,0,0],
'c': [0,0,0,1,0,1,0,0,1,0,1,0],
'd': [0,0,0,0,0,0,1,0,0,0,0,1],
'y':[0,'a','b','c','b','c','d','a','c','a','c','d']})
Can I get some help on how to do it? On my part, I was able to get in following form;
df.loc[:, 'a':'d']=df.loc[:, 'a':'d'].replace(1, pd.Series(df.columns, df.columns))
df['label_concat']=df.loc[:, 'a':'d'].apply(lambda x: '-'.join([i for i in x if i!=0]),axis=1)
This gave me the following output;
x1 x2 a b c d label_concat
0 1 2 0 0 0 0
1 2 -7 a b c 0 a-b-c
2 3 4 0 b c d b-c-d
3 4 3 a 0 c 0 a-c
4 5 2 a 0 c d a-c-d
As seen, it is not the desired output. Can I please get some help on how to modify my approach to get desired output? thanks
Therefore a lot of approaches in the literature transform the multi-label problem into multiple single-label problems, so that the existing single-label algorithms can be used. 1. OneVsRest Traditional two-class and multi-class problems can both be cast into multi-label ones by restricting each instance to have only one label.
An intuitive approach would be to transform a multi-label problem into multiple single-label problems so existing binary classifiers can be used. But scikit-learn provides library scikit-multilearn for multi-label classification.
Most traditional learning algorithms are developed for single-label classification problems. Therefore a lot of approaches in the literature transform the multi-label problem into multiple single-label problems, so that the existing single-label algorithms can be used. 1. OneVsRest
Multi-label classification of textual data is an important problem. Examples range from news articles to emails. For instance, this can be employed to find the genres that a movie belongs to, based on the summary of its plot. Fig-2: Multi-label classification to find genres based on movie posters.
You could try this, to get the desired output based on your original approach:
Option 1
temp=df.loc[:, 'a':'d'].replace(1, pd.Series(df.columns, df.columns))
df['y']=temp.apply(lambda x: [i for i in x if i!=0],axis=1)
df=df.explode('y').fillna(0).reset_index(drop=True)
m=df.loc[1:, 'a':'d'].replace(1, pd.Series(df.columns, df.columns)).apply(lambda x: x==df.y.values[int(x.name)] ,axis=1).astype(int)
df.loc[1:, 'a':'d']=m.astype(int)
Another approach, similar to @ALollz's solution:
Option 2
df=df.assign(y=[np.array(range(i))+1 for i in df.loc[:, 'a':'d'].sum(axis=1)]).explode('y').fillna(1)
m = df.loc[:, 'a':'d'].groupby(level=0).cumsum(1).eq(df.y, axis=0)
df.loc[:, 'a':'d'] = df.loc[:, 'a':'d'].where(m).fillna(0).astype(int)
df['y']=df.loc[:, 'a':'d'].dot(df.columns[list(df.columns).index('a'):list(df.columns).index('d')+1]).replace('',0)
Output:
df
x1 x2 a b c d y
0 1 2 0 0 0 0 0
1 2 -7 1 0 0 0 a
1 2 -7 0 1 0 0 b
1 2 -7 0 0 1 0 c
2 3 4 0 1 0 0 b
2 3 4 0 0 1 0 c
2 3 4 0 0 0 1 d
3 4 3 1 0 0 0 a
3 4 3 0 0 1 0 c
4 5 2 1 0 0 0 a
4 5 2 0 0 1 0 c
4 5 2 0 0 0 1 d
Explanation of Option 1:
First, we use your approach, but instead of change the original data, use copy temp
, and also instead of joining the columns into a string, keep them as a list:
temp=df.loc[:, 'a':'d'].replace(1, pd.Series(df.columns, df.columns))
df['y']=temp.apply(lambda x: [i for i in x if i!=0],axis=1) #without join
df['y']
0 []
1 [a, b, c]
2 [b, c, d]
3 [a, c]
4 [a, c, d]
Then we can use pd.DataFrame.explode
to get the lists expanded, pd.DataFrame.fillna(0)
to fill the first row, and pd.DataFrame.reset_index()
:
df=df.explode('y').fillna(0).reset_index(drop=True)
df
x1 x2 a b c d y
0 1 2 0 0 0 0 0
1 2 -7 1 1 1 0 a
2 2 -7 1 1 1 0 b
3 2 -7 1 1 1 0 c
4 3 4 0 1 1 1 b
5 3 4 0 1 1 1 c
6 3 4 0 1 1 1 d
7 4 3 1 0 1 0 a
8 4 3 1 0 1 0 c
9 5 2 1 0 1 1 a
10 5 2 1 0 1 1 c
11 5 2 1 0 1 1 d
Then we mask df.loc[1:, 'a':'d']
to see when it is equal to y
column, and then, we cast the mask to int, using astype(int)
:
m=df.loc[1:, 'a':'d'].replace(1, pd.Series(df.columns, df.columns)).apply(lambda x: x==df.label_concat.values[int(x.name)] ,axis=1)
m
a b c d
1 True False False False
2 False True False False
3 False False True False
4 False True False False
5 False False True False
6 False False False True
7 True False False False
8 False False True False
9 True False False False
10 False False True False
11 False False False True
df.loc[1:, 'a':'d']=m.astype(int)
df.loc[1:, 'a':'d']
a b c d
1 1 0 0 0
2 0 1 0 0
3 0 0 1 0
4 0 1 0 0
5 0 0 1 0
6 0 0 0 1
7 1 0 0 0
8 0 0 1 0
9 1 0 0 0
10 0 0 1 0
11 0 0 0 1
Important: Note that in the last step we are excluding first row in this case, because it will be True all value in row in the mask, since all values are 0, for a general way you could try this:
#Replace NaN values (the empty list from original df) with ''
df=df.explode('y').fillna('').reset_index(drop=True)
#make the mask with all the rows
msk=df.loc[:, 'a':'d'].replace(1, pd.Series(df.columns, df.columns)).apply(lambda x: x==df.label_concat.values[int(x.name)] ,axis=1)
df.loc[:, 'a':'d']=msk.astype(int)
#Then, replace the original '' (NaN values) with 0
df=df.replace('',0)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With