Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to convert pandas dataframe rows into columns, based on category?

Tags:

python

pandas

I have a pandas data frame with a category variable and some number variables. Something like this:

ls = [{'count':5, 'module':'payroll', 'id':2}, {'count': 53, 'module': 'general','id':2}, {'id': 5,'count': 35, 'module': 'tax'}, ]
df = pd.DataFrame.from_dict(ls)

The df looks like this:

 df
Out[15]: 
   count  id   module
0      5   2  payroll
1     53   2  general
2     35   5      tax

I want convert(transpose is the right word?) the module variables into columns and group by the id. So something like:

   general_count  id  payroll_count  tax_count
0           53.0   2            5.0        NaN
1            NaN   5            NaN       35.0

One approach to this would be to use apply:

df['payroll_count'] = df.id.apply(lambda x: df[df.id==x][df.module=='payroll'])

However, this suffers from multiple drawbacks:

  1. Costly, and takes too much time

  2. Creates artifacts and empty dataframes that need to be cleaned up.

I sense there's a better way to achieve this with pandas groupby, but can't find a way to this same operation more efficiently. Please help.

like image 800
Software Mechanic Avatar asked Sep 22 '16 10:09

Software Mechanic


People also ask

How do I convert rows to columns in pandas?

Pandas DataFrame: transpose() function The transpose() function is used to transpose index and columns. Reflect the DataFrame over its main diagonal by writing rows as columns and vice-versa. If True, the underlying data is copied. Otherwise (default), no copy is made if possible.

How do you transpose rows to columns in Python?

Use the T attribute or the transpose() method to swap (= transpose) the rows and columns of DataFrame. Neither method changes an original object but returns the new object with the rows and columns swapped (= transposed object).

What is the difference between LOC [] and ILOC []?

The main difference between pandas loc[] vs iloc[] is loc gets DataFrame rows & columns by labels/names and iloc[] gets by integer Index/position. For loc[], if the label is not present it gives a key error. For iloc[], if the position is not present it gives an index error.


1 Answers

You can use groupby by columns which first create new index and last column. then need aggreagate some way - I use mean, then convert one column DataFrame to Series by DataFrame.squeeze (then is not necessary remove top level of Multiindex in columns) and reshape by unstack. Last add_suffix to column name:

df = df.groupby(['id','module']).mean().squeeze().unstack().add_suffix('_count')
print (df)
module  general_count  payroll_count  tax_count
id                                             
2                53.0            5.0        NaN
5                 NaN            NaN       35.0

Another solution with pivot, then need remove Multiindex from columns by list comprehension:

df = df.pivot(index='id', columns='module')
df.columns = ['_'.join((col[1], col[0])) for col in df.columns]
print (df)
    general_count  payroll_count  tax_count
id                                         
2            53.0            5.0        NaN
5             NaN            NaN       35.0
like image 194
jezrael Avatar answered Oct 04 '22 19:10

jezrael