Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: Transform pairs of columns to cells

Tags:

python

pandas

I have a pandas data frame like this:

ts | thing_0 | qty_0 | thing_1 | qty_1 | thing_2 | qty_2
--------------------------------------------------------
1  | dog     |     5 | cat     |     3 | mouse   |     1
2  | house   |     6 | dog     |     4 | cat     |     2
...

I want to transform this in a way that the things become columns and the qtys become the cell values. Like this:

ts | dog | cat | mouse | house
------------------------------
1  |   5 |   3 |     1 |     0
2  |   4 |   2 |     0 |     6
...

Currently, I'm doing this transformation manually by iterating over the df.values array manually, but this is very slow. Is there a faster way to implement this with pandas means?

I have seen df.pivot, but couldn't find a way to describe the relationship between thing_0 and qty_0.

like image 688
knub Avatar asked Mar 08 '23 16:03

knub


2 Answers

You could use lreshape to coalesce the thing and qty columns:

In [10]: pd.lreshape(df, {'thing':['thing_0','thing_1','thing_2',], 'qty':['qty_0','qty_1','qty_2']})
Out[10]: 
   ts  thing  qty
0   1    dog    5
1   2  house    6
2   1    cat    3
3   2    dog    4
4   1  mouse    1
5   2    cat    2

Then pivot to create the desired DataFrame:

import pandas as pd

df = pd.DataFrame({'qty_0': [5, 6], 'qty_1': [3, 4], 'qty_2': [1, 2], 'thing_0': ['dog', 'house'], 'thing_1': ['cat', 'dog'], 'thing_2': ['mouse', 'cat'], 'ts': [1, 2]})

reshaped = pd.lreshape(df, {'thing':['thing_0','thing_1','thing_2',], 
                            'qty':['qty_0','qty_1','qty_2']})

result = reshaped.pivot(index='ts', columns='thing', values='qty')
print(result)

yields

thing  cat  dog  house  mouse
ts                           
1      3.0  5.0    0.0    1.0
2      2.0  4.0    6.0    0.0

I think jezrael's solution is better since it takes advantage of the regularity of the column names you wish to coalesce. df.columns.str.split('_', expand=True) is more general and less repetitive than

{'thing':['thing_0','thing_1','thing_2',], 
 'qty':['qty_0','qty_1','qty_2']}

lreshape might be helpful in situations where the column names you wish to coalesce are irregular.

like image 140
unutbu Avatar answered Mar 20 '23 09:03

unutbu


You can create MultiIndex from columns by str.split and then reshape by stack and unstack:

df = df.set_index('ts')
df.columns = df.columns.str.split('_', expand=True)
df = df.stack().reset_index(level=1, drop=True)
       .set_index('thing', append=True)['qty'].unstack(fill_value=0)
print (df)
thing  cat  dog  house  mouse
ts                           
1        3    5      0      1
2        2    4      6      0

Another solution with pivot instead unstack:

df = df.set_index('ts')
df.columns = df.columns.str.split('_', expand=True)
df = df.stack().reset_index()
df = df.pivot(index='ts', columns='thing', values='qty').fillna(0).astype(int)
print (df)
thing  cat  dog  house  mouse
ts                           
1        3    5      0      1
2        2    4      6      0

And third solution - dynamically create dict for lreshape and then use unstack:

Notice - columns have to be sorted, if not add df = df.sort_index(axis=1)

t = [x for x in df.columns if x.startswith('thing')]
q = [x for x in df.columns if x.startswith('qty')]
df = pd.lreshape(df, {'thing':t, 'qty':q})
       .set_index(['ts','thing'])['qty'].unstack(fill_value=0)

print (df)
thing  cat  dog  house  mouse
ts                           
1        3    5      0      1
2        2    4      6      0

EDIT:

lreshape is now undocumented, but is possible in future will by removed (with pd.wide_to_long too).

Possible solution is merging all 3 functions to one - maybe melt, but now it is not implementated. Maybe in some new version of pandas. Then my answer will be updated.

like image 44
jezrael Avatar answered Mar 20 '23 07:03

jezrael