I have a pandas data frame with a category variable and some number variables. Something like this:
ls = [{'count':5, 'module':'payroll', 'id':2}, {'count': 53, 'module': 'general','id':2}, {'id': 5,'count': 35, 'module': 'tax'}, ]
df = pd.DataFrame.from_dict(ls)
The df looks like this:
df
Out[15]:
count id module
0 5 2 payroll
1 53 2 general
2 35 5 tax
I want convert(transpose is the right word?) the module variables into columns and group by the id. So something like:
general_count id payroll_count tax_count
0 53.0 2 5.0 NaN
1 NaN 5 NaN 35.0
One approach to this would be to use apply:
df['payroll_count'] = df.id.apply(lambda x: df[df.id==x][df.module=='payroll'])
However, this suffers from multiple drawbacks:
Costly, and takes too much time
Creates artifacts and empty dataframes that need to be cleaned up.
I sense there's a better way to achieve this with pandas groupby, but can't find a way to this same operation more efficiently. Please help.
Pandas DataFrame: transpose() function The transpose() function is used to transpose index and columns. Reflect the DataFrame over its main diagonal by writing rows as columns and vice-versa. If True, the underlying data is copied. Otherwise (default), no copy is made if possible.
Use the T attribute or the transpose() method to swap (= transpose) the rows and columns of DataFrame. Neither method changes an original object but returns the new object with the rows and columns swapped (= transposed object).
The main difference between pandas loc[] vs iloc[] is loc gets DataFrame rows & columns by labels/names and iloc[] gets by integer Index/position. For loc[], if the label is not present it gives a key error. For iloc[], if the position is not present it gives an index error.
You can use groupby
by columns which first create new index
and last column
. then need aggreagate some way - I use mean
, then convert one column DataFrame
to Series
by DataFrame.squeeze
(then is not necessary remove top level of Multiindex in columns) and reshape by unstack
. Last add_suffix
to column name:
df = df.groupby(['id','module']).mean().squeeze().unstack().add_suffix('_count')
print (df)
module general_count payroll_count tax_count
id
2 53.0 5.0 NaN
5 NaN NaN 35.0
Another solution with pivot
, then need remove Multiindex
from columns by list comprehension
:
df = df.pivot(index='id', columns='module')
df.columns = ['_'.join((col[1], col[0])) for col in df.columns]
print (df)
general_count payroll_count tax_count
id
2 53.0 5.0 NaN
5 NaN NaN 35.0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With