Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Transforming multilabels to single label problem

Tags:

python

pandas

I am working on a data manipulation exercise, where the original dataset looks like;

df = pd.DataFrame({
'x1': [1, 2, 3, 4, 5],
'x2': [2, -7, 4, 3, 2],
'a': [0, 1, 0, 1, 1],
'b': [0, 1, 1, 0, 0],
'c': [0, 1, 1, 1, 1],
'd': [0, 0, 1, 0, 1]})

Here the columns a,b,c are categories whereas x,x2 are features. The goal is to convert this dataset into following format;

dfnew1 = pd.DataFrame({
'x1': [1, 2,2,2, 3,3,3, 4,4, 5,5,5],
'x2': [2, -7,-7,-7, 4,4,4, 3,3, 2,2,2],
'a': [0, 1,0,0, 0,0,0, 1,0,1,0,0],
'b': [0, 0,1,0, 1,0,0,0, 0, 0,0,0],
'c': [0,0,0,1,0,1,0,0,1,0,1,0],
'd': [0,0,0,0,0,0,1,0,0,0,0,1],
'y':[0,'a','b','c','b','c','d','a','c','a','c','d']})

Can I get some help on how to do it? On my part, I was able to get in following form;


df.loc[:, 'a':'d']=df.loc[:, 'a':'d'].replace(1, pd.Series(df.columns, df.columns))
df['label_concat']=df.loc[:, 'a':'d'].apply(lambda x: '-'.join([i for i in x if i!=0]),axis=1)

This gave me the following output;



   x1   x2  a   b   c   d   label_concat
0   1   2   0   0   0   0       
1   2   -7  a   b   c   0   a-b-c
2   3   4   0   b   c   d   b-c-d
3   4   3   a   0   c   0   a-c
4   5   2   a   0   c   d   a-c-d

As seen, it is not the desired output. Can I please get some help on how to modify my approach to get desired output? thanks

like image 368
jay Avatar asked Jul 16 '20 22:07

jay


People also ask

How to transform a multi-label problem into a single-label one?

Therefore a lot of approaches in the literature transform the multi-label problem into multiple single-label problems, so that the existing single-label algorithms can be used. 1. OneVsRest Traditional two-class and multi-class problems can both be cast into multi-label ones by restricting each instance to have only one label.

How can we assign multiple labels to one instance?

An intuitive approach would be to transform a multi-label problem into multiple single-label problems so existing binary classifiers can be used. But scikit-learn provides library scikit-multilearn for multi-label classification.

Can multi-label classification problems be solved using single-label algorithms?

Most traditional learning algorithms are developed for single-label classification problems. Therefore a lot of approaches in the literature transform the multi-label problem into multiple single-label problems, so that the existing single-label algorithms can be used. 1. OneVsRest

What is an example of a multi label classification?

Multi-label classification of textual data is an important problem. Examples range from news articles to emails. For instance, this can be employed to find the genres that a movie belongs to, based on the summary of its plot. Fig-2: Multi-label classification to find genres based on movie posters.


1 Answers

You could try this, to get the desired output based on your original approach:

Option 1

temp=df.loc[:, 'a':'d'].replace(1, pd.Series(df.columns, df.columns))
df['y']=temp.apply(lambda x: [i for i in x if i!=0],axis=1)
df=df.explode('y').fillna(0).reset_index(drop=True)
m=df.loc[1:, 'a':'d'].replace(1, pd.Series(df.columns, df.columns)).apply(lambda x: x==df.y.values[int(x.name)] ,axis=1).astype(int)
df.loc[1:, 'a':'d']=m.astype(int)

Another approach, similar to @ALollz's solution:

Option 2

df=df.assign(y=[np.array(range(i))+1 for i in df.loc[:, 'a':'d'].sum(axis=1)]).explode('y').fillna(1)
m = df.loc[:, 'a':'d'].groupby(level=0).cumsum(1).eq(df.y, axis=0) 
df.loc[:, 'a':'d'] = df.loc[:, 'a':'d'].where(m).fillna(0).astype(int)
df['y']=df.loc[:, 'a':'d'].dot(df.columns[list(df.columns).index('a'):list(df.columns).index('d')+1]).replace('',0)

Output:

df
  x1  x2  a  b  c  d  y
0   1   2  0  0  0  0  0
1   2  -7  1  0  0  0  a
1   2  -7  0  1  0  0  b
1   2  -7  0  0  1  0  c
2   3   4  0  1  0  0  b
2   3   4  0  0  1  0  c
2   3   4  0  0  0  1  d
3   4   3  1  0  0  0  a
3   4   3  0  0  1  0  c
4   5   2  1  0  0  0  a
4   5   2  0  0  1  0  c
4   5   2  0  0  0  1  d

Explanation of Option 1:

First, we use your approach, but instead of change the original data, use copy temp, and also instead of joining the columns into a string, keep them as a list:

temp=df.loc[:, 'a':'d'].replace(1, pd.Series(df.columns, df.columns))
df['y']=temp.apply(lambda x: [i for i in x if i!=0],axis=1)   #without join

df['y']
0           []
1    [a, b, c]
2    [b, c, d]
3       [a, c]
4    [a, c, d]

Then we can use pd.DataFrame.explode to get the lists expanded, pd.DataFrame.fillna(0) to fill the first row, and pd.DataFrame.reset_index():

df=df.explode('y').fillna(0).reset_index(drop=True)

df
    x1  x2  a  b  c  d            y
0    1   2  0  0  0  0            0
1    2  -7  1  1  1  0            a
2    2  -7  1  1  1  0            b
3    2  -7  1  1  1  0            c
4    3   4  0  1  1  1            b
5    3   4  0  1  1  1            c
6    3   4  0  1  1  1            d
7    4   3  1  0  1  0            a
8    4   3  1  0  1  0            c
9    5   2  1  0  1  1            a
10   5   2  1  0  1  1            c
11   5   2  1  0  1  1            d

Then we mask df.loc[1:, 'a':'d'] to see when it is equal to y column, and then, we cast the mask to int, using astype(int):

m=df.loc[1:, 'a':'d'].replace(1, pd.Series(df.columns, df.columns)).apply(lambda x: x==df.label_concat.values[int(x.name)] ,axis=1)

m
        a      b      c      d
1    True  False  False  False
2   False   True  False  False
3   False  False   True  False
4   False   True  False  False
5   False  False   True  False
6   False  False  False   True
7    True  False  False  False
8   False  False   True  False
9    True  False  False  False
10  False  False   True  False
11  False  False  False   True



df.loc[1:, 'a':'d']=m.astype(int)

df.loc[1:, 'a':'d']
   a  b  c  d
1   1  0  0  0
2   0  1  0  0
3   0  0  1  0
4   0  1  0  0
5   0  0  1  0
6   0  0  0  1
7   1  0  0  0
8   0  0  1  0
9   1  0  0  0
10  0  0  1  0
11  0  0  0  1

Important: Note that in the last step we are excluding first row in this case, because it will be True all value in row in the mask, since all values are 0, for a general way you could try this:

#Replace NaN values (the empty list from original df) with ''
df=df.explode('y').fillna('').reset_index(drop=True)

#make the mask with all the rows
msk=df.loc[:, 'a':'d'].replace(1, pd.Series(df.columns, df.columns)).apply(lambda x: x==df.label_concat.values[int(x.name)] ,axis=1)
df.loc[:, 'a':'d']=msk.astype(int)

#Then, replace the original '' (NaN values) with 0
df=df.replace('',0)
like image 89
MrNobody33 Avatar answered Sep 30 '22 18:09

MrNobody33