Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas column values to row values

I have a dataset (171 columns) and when I take it into my dataframe, it looks like this way-

ANO MNO UJ2010  DJ2010   UF2010 DF2010   UM2010 DM2010    UA2010    DA2010 ...
1   A   113   06/01/2010    129 06/02/2010  143 06/03/2010  209 05/04/2010 ...
2   B   218   06/01/2010    211 06/02/2010  244 06/03/2010  348 05/04/2010 ...
3   C   22    06/01/2010    114 06/02/2010  100 06/03/2010  151 05/04/2010 ...

Now I want to change my dataframe like this way -

    ANO MNO Time        Unit
    1   A   06/01/2010  113
    1   A   06/02/2010  129
    1   A   06/03/2010  143
    2   B   06/01/2010  218
    2   B   06/02/2010  211
    2   B   06/03/2010  244
    3   C   06/01/2010  22
    3   C   06/02/2010  114
    3   C   06/03/2010  100
....
.....

I tried to use pd.melt, but I think it does not fullfil my purpose. How can I do this?

like image 954
pd farhad Avatar asked Mar 02 '17 07:03

pd farhad


2 Answers

Use pd.lreshape as a close alternative to pd.melt after filtering the columns to be grouped under the distinct headers.

Through the use of pd.lreshape, when you inject a dictionary object as it's groups parameter, the keys would take on the new header name and all the list of column names fed as values to this dict would be cast under that single header. Thus, it produces a long formatted DF after the transformation.

Finally sort the DF w.r.t the unused columns to align these accordingly.

Then, a reset_index(drop=True) at the end to relabel the index axis to the default integer values by dropping off the intermediate index.

d = pd.lreshape(df, {"Time": df.filter(regex=r'^D').columns, 
                     "Unit": df.filter(regex=r'^U').columns})

d.sort_values(['ANO', 'MNO']).reset_index(drop=True)

enter image description here


If there's a mismatch in the length of the grouping columns, then:

from itertools import groupby, chain

unused_cols = ['ANO', 'MNO']
cols = df.columns.difference(unused_cols)

# filter based on the common strings starting from the first slice upto end.
fnc = lambda x: x[1:] 
pref1, pref2 = "D", "U"

# Obtain groups based on a common interval of slices.
groups = [list(g) for n, g in groupby(sorted(cols, key=fnc), key=fnc)]

# Fill single length list with it's other char counterpart.
fill_missing = [i if len(i)==2 else i + 
                [pref1 + i[0][1:] if i[0][0] == pref2 else pref2 + i[0][1:]]
                for i in groups]

# Reindex based on newly obtained column names.
df = df.reindex(columns=unused_cols + list(chain(*fill_missing)))

Continue the same steps with pd.lreshape as mentioned above but this time with dropna=False parameter included.

like image 121
Nickil Maveli Avatar answered Sep 19 '22 10:09

Nickil Maveli


You can reshape by stack but first create MultiIndex in columns with % and //.

MultiIndex values map pairs Time and Unit to second level of MultiIndex by floor division (//) by 2, differences of each pairs are created by modulo division (%).

Then stack use last level created by // and create new level of MultiIndex in index, which is not necessary, so is removed by reset_index(level=2, drop=True).

Last reset_index for convert first and second level to columns.

[[1,0]] is for swap columns for change ordering.

df = df.set_index(['ANO','MNO'])
cols = np.arange(len(df.columns))
df.columns = [cols % 2, cols // 2]

print (df)
           0           1    0           1    0           1    0           1
           0           0    1           1    2           2    3           3
ANO MNO                                                                    
1   A    113  06/01/2010  129  06/02/2010  143  06/03/2010  209  05/04/2010
2   B    218  06/01/2010  211  06/02/2010  244  06/03/2010  348  05/04/2010
3   C     22  06/01/2010  114  06/02/2010  100  06/03/2010  151  05/04/2010

df = df.stack()[[1,0]].reset_index(level=2, drop=True).reset_index()
df.columns = ['ANO','MNO','Time','Unit']
print (df)
    ANO MNO        Time  Unit
0     1   A  06/01/2010   113
1     1   A  06/02/2010   129
2     1   A  06/03/2010   143
3     1   A  05/04/2010   209
4     2   B  06/01/2010   218
5     2   B  06/02/2010   211
6     2   B  06/03/2010   244
7     2   B  05/04/2010   348
8     3   C  06/01/2010    22
9     3   C  06/02/2010   114
10    3   C  06/03/2010   100
11    3   C  05/04/2010   151

EDIT:

#last column is missing 
print (df)
   ANO MNO  UJ2010      DJ2010  UF2010      DF2010  UM2010      DM2010  UA2010
0    1   A     113  06/01/2010     129  06/02/2010     143  06/03/2010     209
1    2   B     218  06/01/2010     211  06/02/2010     244  06/03/2010     348
2    3   C      22  06/01/2010     114  06/02/2010     100  06/03/2010     151

df = df.set_index(['ANO','MNO'])
#MultiIndex is created by first character of column names with all another
df.columns = [df.columns.str[0], df.columns.str[1:]]
print (df)
            U           D     U           D     U           D     U
        J2010       J2010 F2010       F2010 M2010       M2010 A2010
ANO MNO                                                            
1   A     113  06/01/2010   129  06/02/2010   143  06/03/2010   209
2   B     218  06/01/2010   211  06/02/2010   244  06/03/2010   348
3   C      22  06/01/2010   114  06/02/2010   100  06/03/2010   151


#stack add missing values, replace them by NaN
df = df.stack().reset_index(level=2, drop=True).reset_index()
df.columns = ['ANO','MNO','Time','Unit']
print (df)
    ANO MNO        Time  Unit
0     1   A         NaN   209
1     1   A  06/02/2010   129
2     1   A  06/01/2010   113
3     1   A  06/03/2010   143
4     2   B         NaN   348
5     2   B  06/02/2010   211
6     2   B  06/01/2010   218
7     2   B  06/03/2010   244
8     3   C         NaN   151
9     3   C  06/02/2010   114
10    3   C  06/01/2010    22
11    3   C  06/03/2010   100
like image 28
jezrael Avatar answered Sep 19 '22 10:09

jezrael