Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

convert rows to columns in pandas dataframe

I'd like to convert to below Df1 to Df2.

The empty values would be filled with Nan.

Below Dfs are examples.

  • My data has weeks from 1 to 8.
  • IDs are 100,000.

  • Only week 8 has all IDs, so total rows will be 100,000.

I have Df3 which has 100,000 of id, and I want to merge df1 on Df3 formatted as df2.

ex) pd.merge(df3, df1, on="id", how="left") -> but, formatted as df2

 Df1>
 wk, id, col1, col2  ...
 1    1   0.5  15  
 2    2   0.5  15  
 3    3   0.5  15  
 1    2   0.5  15  
 3    2   0.5  15  

 ------
 Df2>
 wk1, id, col1, col2, wk2, id, col1, col2, wk3,  id, col1, col2,...
 1    1   0.5  15      2    1   Nan   Nan   3    1   Nan   Nan
 1    2   0.5  15      2    2   0.5  15     3    2   0.5    15
 1    3   Nan  Nan     2    3   Nan   Nan   3    3   0.5    15
like image 526
Ken Kim Avatar asked Aug 01 '18 06:08

Ken Kim


2 Answers

Use:

#create dictionary for rename columns for correct sorting
d = dict(enumerate(df.columns))
d1 = {v:k for k, v in d.items()}

#first add missing values for each `wk` and `id`
df1 = df.set_index(['wk', 'id']).unstack().stack(dropna=False).reset_index()

#for each id create DataFrame, reshape by unstask and rename columns
df1 = (df1.groupby('id')
       .apply(lambda x: pd.DataFrame(x.values, columns=df.columns))
       .unstack()
       .reset_index(drop=True)
       .rename(columns=d1, level=0)
       .sort_index(axis=1, level=1)
       .rename(columns=d, level=0))

#convert values to integers if necessary
df1.loc[:, ['wk', 'id']] = df1.loc[:, ['wk', 'id']].astype(int)

#flatten MultiIndex in columns
df1.columns = ['{}_{}'.format(a, b) for a, b in df1.columns]
print (df1)

   wk_0  id_0  col1_0  col2_0  wk_1  id_1  col1_1  col2_1  wk_2  id_2  col1_2  \
0     1     1     0.5    15.0     2     1     NaN     NaN     3     1     NaN   
1     1     2     0.5    15.0     2     2     0.5    15.0     3     2     0.5   
2     1     3     NaN     NaN     2     3     NaN     NaN     3     3     0.5   

   col2_2  
0     NaN  
1    15.0  
2    15.0  
like image 83
jezrael Avatar answered Oct 22 '22 15:10

jezrael


You can use GroupBy + concat. The idea is to create a list of dataframes with appropriately named columns and appropriate index. The concatenate along axis=1:

d = {k: v.reset_index(drop=True) for k, v in df.groupby('wk')}

def formatter(df, key):
    return df.rename(columns={'w': f'wk{key}'}).set_index('id')

L = [formatter(df, key) for key, df in d.items()]
res = pd.concat(L, axis=1).reset_index()

print(res)

   id   wk  col1  col2   wk  col1  col2   wk  col1  col2
0   1  1.0   0.5  15.0  NaN   NaN   NaN  NaN   NaN   NaN
1   2  1.0   0.5  15.0  2.0   0.5  15.0  3.0   0.5  15.0
2   3  NaN   NaN   NaN  NaN   NaN   NaN  3.0   0.5  15.0

Note NaN forces your series to become float. There's no "good" fix for this.

like image 2
jpp Avatar answered Oct 22 '22 15:10

jpp