I am working on a data manipulation exercise, where the original dataset looks like; <pre class="prettyprint"><code>df = pd.DataFrame({ 'x1': [1, 2, 3, 4, 5], 'x2': [2, -7, 4, 3, 2], 'a': [0, 1, 0, 1, 1], 'b': [0, 1, 1, 0, 0], 'c': [0, 1, 1, 1, 1], 'd': [0, 0, 1, 0, 1]}) </code></pre> Here the columns <code>a</code>,<code>b</code>,<code>c</code> are categories whereas <code>x</code>,<code>x2</code> are features. The goal is to convert this dataset into following format; <pre class="prettyprint"><code>dfnew1 = pd.DataFrame({ 'x1': [1, 2,2,2, 3,3,3, 4,4, 5,5,5], 'x2': [2, -7,-7,-7, 4,4,4, 3,3, 2,2,2], 'a': [0, 1,0,0, 0,0,0, 1,0,1,0,0], 'b': [0, 0,1,0, 1,0,0,0, 0, 0,0,0], 'c': [0,0,0,1,0,1,0,0,1,0,1,0], 'd': [0,0,0,0,0,0,1,0,0,0,0,1], 'y':[0,'a','b','c','b','c','d','a','c','a','c','d']}) </code></pre> Can I get some help on how to do it? On my part, I was able to get in following form; <pre class="prettyprint"><code> df.loc[:, 'a':'d']=df.loc[:, 'a':'d'].replace(1, pd.Series(df.columns, df.columns)) df['label_concat']=df.loc[:, 'a':'d'].apply(lambda x: '-'.join([i for i in x if i!=0]),axis=1) </code></pre> This gave me the following output; <pre class="prettyprint"><code> x1 x2 a b c d label_concat 0 1 2 0 0 0 0 1 2 -7 a b c 0 a-b-c 2 3 4 0 b c d b-c-d 3 4 3 a 0 c 0 a-c 4 5 2 a 0 c d a-c-d </code></pre> As seen, it is not the desired output. Can I please get some help on how to modify my approach to get desired output? thanks

You could try this, to get the desired output based on your original approach: Option 1 <pre class="prettyprint"><code>temp=df.loc[:, 'a':'d'].replace(1, pd.Series(df.columns, df.columns)) df['y']=temp.apply(lambda x: [i for i in x if i!=0],axis=1) df=df.explode('y').fillna(0).reset_index(drop=True) m=df.loc[1:, 'a':'d'].replace(1, pd.Series(df.columns, df.columns)).apply(lambda x: x==df.y.values[int(x.name)] ,axis=1).astype(int) df.loc[1:, 'a':'d']=m.astype(int) </code></pre> <hr> Another approach, similar to @ALollz's solution: Option 2 <pre class="prettyprint"><code>df=df.assign(y=[np.array(range(i))+1 for i in df.loc[:, 'a':'d'].sum(axis=1)]).explode('y').fillna(1) m = df.loc[:, 'a':'d'].groupby(level=0).cumsum(1).eq(df.y, axis=0) df.loc[:, 'a':'d'] = df.loc[:, 'a':'d'].where(m).fillna(0).astype(int) df['y']=df.loc[:, 'a':'d'].dot(df.columns[list(df.columns).index('a'):list(df.columns).index('d')+1]).replace('',0) </code></pre> Output: <pre class="prettyprint"><code>df x1 x2 a b c d y 0 1 2 0 0 0 0 0 1 2 -7 1 0 0 0 a 1 2 -7 0 1 0 0 b 1 2 -7 0 0 1 0 c 2 3 4 0 1 0 0 b 2 3 4 0 0 1 0 c 2 3 4 0 0 0 1 d 3 4 3 1 0 0 0 a 3 4 3 0 0 1 0 c 4 5 2 1 0 0 0 a 4 5 2 0 0 1 0 c 4 5 2 0 0 0 1 d </code></pre> <hr> Explanation of Option 1: First, we use your approach, but instead of change the original data, use copy <code>temp</code>, and also instead of joining the columns into a string, keep them as a list: <pre class="prettyprint"><code>temp=df.loc[:, 'a':'d'].replace(1, pd.Series(df.columns, df.columns)) df['y']=temp.apply(lambda x: [i for i in x if i!=0],axis=1) #without join df['y'] 0 [] 1 [a, b, c] 2 [b, c, d] 3 [a, c] 4 [a, c, d] </code></pre> Then we can use <code>pd.DataFrame.explode</code> to get the lists expanded, <code>pd.DataFrame.fillna(0)</code> to fill the first row, and <code>pd.DataFrame.reset_index()</code>: <pre class="prettyprint"><code>df=df.explode('y').fillna(0).reset_index(drop=True) df x1 x2 a b c d y 0 1 2 0 0 0 0 0 1 2 -7 1 1 1 0 a 2 2 -7 1 1 1 0 b 3 2 -7 1 1 1 0 c 4 3 4 0 1 1 1 b 5 3 4 0 1 1 1 c 6 3 4 0 1 1 1 d 7 4 3 1 0 1 0 a 8 4 3 1 0 1 0 c 9 5 2 1 0 1 1 a 10 5 2 1 0 1 1 c 11 5 2 1 0 1 1 d </code></pre> Then we mask <code>df.loc[1:, 'a':'d']</code> to see when it is equal to <code>y</code> column, and then, we cast the mask to int, using <code>astype(int)</code>: <pre class="prettyprint"><code>m=df.loc[1:, 'a':'d'].replace(1, pd.Series(df.columns, df.columns)).apply(lambda x: x==df.label_concat.values[int(x.name)] ,axis=1) m a b c d 1 True False False False 2 False True False False 3 False False True False 4 False True False False 5 False False True False 6 False False False True 7 True False False False 8 False False True False 9 True False False False 10 False False True False 11 False False False True df.loc[1:, 'a':'d']=m.astype(int) df.loc[1:, 'a':'d'] a b c d 1 1 0 0 0 2 0 1 0 0 3 0 0 1 0 4 0 1 0 0 5 0 0 1 0 6 0 0 0 1 7 1 0 0 0 8 0 0 1 0 9 1 0 0 0 10 0 0 1 0 11 0 0 0 1 </code></pre> Important: Note that in the last step we are excluding first row in this case, because it will be True all value in row in the mask, since all values are 0, for a general way you could try this: <pre class="prettyprint"><code>#Replace NaN values (the empty list from original df) with '' df=df.explode('y').fillna('').reset_index(drop=True) #make the mask with all the rows msk=df.loc[:, 'a':'d'].replace(1, pd.Series(df.columns, df.columns)).apply(lambda x: x==df.label_concat.values[int(x.name)] ,axis=1) df.loc[:, 'a':'d']=msk.astype(int) #Then, replace the original '' (NaN values) with 0 df=df.replace('',0) </code></pre>

Transforming multilabels to single label problem

Tags:

python

pandas

I am working on a data manipulation exercise, where the original dataset looks like;

df = pd.DataFrame({
'x1': [1, 2, 3, 4, 5],
'x2': [2, -7, 4, 3, 2],
'a': [0, 1, 0, 1, 1],
'b': [0, 1, 1, 0, 0],
'c': [0, 1, 1, 1, 1],
'd': [0, 0, 1, 0, 1]})

Here the columns a,b,c are categories whereas x,x2 are features. The goal is to convert this dataset into following format;

dfnew1 = pd.DataFrame({
'x1': [1, 2,2,2, 3,3,3, 4,4, 5,5,5],
'x2': [2, -7,-7,-7, 4,4,4, 3,3, 2,2,2],
'a': [0, 1,0,0, 0,0,0, 1,0,1,0,0],
'b': [0, 0,1,0, 1,0,0,0, 0, 0,0,0],
'c': [0,0,0,1,0,1,0,0,1,0,1,0],
'd': [0,0,0,0,0,0,1,0,0,0,0,1],
'y':[0,'a','b','c','b','c','d','a','c','a','c','d']})

Can I get some help on how to do it? On my part, I was able to get in following form;


df.loc[:, 'a':'d']=df.loc[:, 'a':'d'].replace(1, pd.Series(df.columns, df.columns))
df['label_concat']=df.loc[:, 'a':'d'].apply(lambda x: '-'.join([i for i in x if i!=0]),axis=1)

This gave me the following output;



   x1   x2  a   b   c   d   label_concat
0   1   2   0   0   0   0       
1   2   -7  a   b   c   0   a-b-c
2   3   4   0   b   c   d   b-c-d
3   4   3   a   0   c   0   a-c
4   5   2   a   0   c   d   a-c-d

As seen, it is not the desired output. Can I please get some help on how to modify my approach to get desired output? thanks

368

asked Jul 16 '20 22:07

jay

1 Answers

You could try this, to get the desired output based on your original approach:

Option 1

temp=df.loc[:, 'a':'d'].replace(1, pd.Series(df.columns, df.columns))
df['y']=temp.apply(lambda x: [i for i in x if i!=0],axis=1)
df=df.explode('y').fillna(0).reset_index(drop=True)
m=df.loc[1:, 'a':'d'].replace(1, pd.Series(df.columns, df.columns)).apply(lambda x: x==df.y.values[int(x.name)] ,axis=1).astype(int)
df.loc[1:, 'a':'d']=m.astype(int)

Another approach, similar to @ALollz's solution:

Option 2

df=df.assign(y=[np.array(range(i))+1 for i in df.loc[:, 'a':'d'].sum(axis=1)]).explode('y').fillna(1)
m = df.loc[:, 'a':'d'].groupby(level=0).cumsum(1).eq(df.y, axis=0) 
df.loc[:, 'a':'d'] = df.loc[:, 'a':'d'].where(m).fillna(0).astype(int)
df['y']=df.loc[:, 'a':'d'].dot(df.columns[list(df.columns).index('a'):list(df.columns).index('d')+1]).replace('',0)

Output:

df
  x1  x2  a  b  c  d  y
0   1   2  0  0  0  0  0
1   2  -7  1  0  0  0  a
1   2  -7  0  1  0  0  b
1   2  -7  0  0  1  0  c
2   3   4  0  1  0  0  b
2   3   4  0  0  1  0  c
2   3   4  0  0  0  1  d
3   4   3  1  0  0  0  a
3   4   3  0  0  1  0  c
4   5   2  1  0  0  0  a
4   5   2  0  0  1  0  c
4   5   2  0  0  0  1  d

Explanation of Option 1:

First, we use your approach, but instead of change the original data, use copy temp, and also instead of joining the columns into a string, keep them as a list:

temp=df.loc[:, 'a':'d'].replace(1, pd.Series(df.columns, df.columns))
df['y']=temp.apply(lambda x: [i for i in x if i!=0],axis=1)   #without join

df['y']
0           []
1    [a, b, c]
2    [b, c, d]
3       [a, c]
4    [a, c, d]

Then we can use pd.DataFrame.explode to get the lists expanded, pd.DataFrame.fillna(0) to fill the first row, and pd.DataFrame.reset_index():

df=df.explode('y').fillna(0).reset_index(drop=True)

df
    x1  x2  a  b  c  d            y
0    1   2  0  0  0  0            0
1    2  -7  1  1  1  0            a
2    2  -7  1  1  1  0            b
3    2  -7  1  1  1  0            c
4    3   4  0  1  1  1            b
5    3   4  0  1  1  1            c
6    3   4  0  1  1  1            d
7    4   3  1  0  1  0            a
8    4   3  1  0  1  0            c
9    5   2  1  0  1  1            a
10   5   2  1  0  1  1            c
11   5   2  1  0  1  1            d

Then we mask df.loc[1:, 'a':'d'] to see when it is equal to y column, and then, we cast the mask to int, using astype(int):

m=df.loc[1:, 'a':'d'].replace(1, pd.Series(df.columns, df.columns)).apply(lambda x: x==df.label_concat.values[int(x.name)] ,axis=1)

m
        a      b      c      d
1    True  False  False  False
2   False   True  False  False
3   False  False   True  False
4   False   True  False  False
5   False  False   True  False
6   False  False  False   True
7    True  False  False  False
8   False  False   True  False
9    True  False  False  False
10  False  False   True  False
11  False  False  False   True



df.loc[1:, 'a':'d']=m.astype(int)

df.loc[1:, 'a':'d']
   a  b  c  d
1   1  0  0  0
2   0  1  0  0
3   0  0  1  0
4   0  1  0  0
5   0  0  1  0
6   0  0  0  1
7   1  0  0  0
8   0  0  1  0
9   1  0  0  0
10  0  0  1  0
11  0  0  0  1

Important: Note that in the last step we are excluding first row in this case, because it will be True all value in row in the mask, since all values are 0, for a general way you could try this:

#Replace NaN values (the empty list from original df) with ''
df=df.explode('y').fillna('').reset_index(drop=True)

#make the mask with all the rows
msk=df.loc[:, 'a':'d'].replace(1, pd.Series(df.columns, df.columns)).apply(lambda x: x==df.label_concat.values[int(x.name)] ,axis=1)
df.loc[:, 'a':'d']=msk.astype(int)

#Then, replace the original '' (NaN values) with 0
df=df.replace('',0)

answered Sep 30 '22 18:09

MrNobody33

Related questions
                            
                                numba-safe version of itertools.combinations?
                            
                                How to change the color of the interactive zoom rectangle?
                            
                                Creating a standalone macOS application with Python and py2app
                            
                                bson.errors.InvalidDocument: key '$numberDecimal' must not start with '$' when using json
                            
                                Clean Docker pip install results in ERROR: THESE PACKAGES DO NOT MATCH THE HASHES FROM THE REQUIREMENTS FILE
                            
                                How to use TPUs with PyTorch?
                            
                                pd.read_feather problems with decimal / thousands separator and rounding problems for floats
                            
                                What kind of objects `yield from` can be used with?
                            
                                PyGame slower on macOS than on Ubuntu or Raspbian
                            
                                What is the best practice for keeping Kafka consumer alive in python?
                            
                                How to use regex to extract text in order?
                            
                                read only particular json files from s3 buckets from multiple folders
                            
                                Unable to install sklearn when building docker image
                            
                                Open Specific Event logs using win32evtlog Python
                            
                                How to tell pip that a package(opencv) has been compiled from source
                            
                                How to make a python context manager catch a SIGINT or SIGTERM signal
                            
                                group by pandas dataframe and select maximun value within sequence
                            
                                How to stop bazel from relying on Python2
                            
                                Symlink (auto-generated) directories via Snakemake
                            
                                Best way to detect if checkbox is ticked

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With