Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reshape pandas dataframe from rows to columns

I'm trying to reshape my data. At first glance, it sounds like a transpose, but it's not. I tried melts, stack/unstack, joins, etc.

Use Case

I want to have only one row per unique individual, and put all job history on the columns. For clients, it can be easier to read information across rows rather than reading through columns.

Here's the data:

import pandas as pd
import numpy as np

data1 = {'Name': ["Joe", "Joe", "Joe","Jane","Jane"],
        'Job': ["Analyst","Manager","Director","Analyst","Manager"],
        'Job Eff Date': ["1/1/2015","1/1/2016","7/1/2016","1/1/2015","1/1/2016"]}
df2 = pd.DataFrame(data1, columns=['Name', 'Job', 'Job Eff Date'])

df2

Here's what I want it to look like: Desired Output Table

enter image description here

like image 252
Christopher Avatar asked Jul 31 '16 07:07

Christopher


People also ask

How do you transpose rows to columns in Pandas?

DataFrame - transpose() functionThe transpose() function is used to transpose index and columns. Reflect the DataFrame over its main diagonal by writing rows as columns and vice-versa. If True, the underlying data is copied. Otherwise (default), no copy is made if possible.

How do you reshape a Pandas DataFrame?

You can use the following basic syntax to convert a pandas DataFrame from a wide format to a long format: df = pd. melt(df, id_vars='col1', value_vars=['col2', 'col3', ...])

How do I pivot rows to columns in Pandas?

Use the T attribute or the transpose() method to swap (= transpose) the rows and columns of pandas. DataFrame . Neither method changes the original object but returns a new object with the rows and columns swapped (= transposed object).

What is the flatten method in Pandas?

Return a copy of the array collapsed into one dimension. Whether to flatten in C (row-major), Fortran (column-major) order, or preserve the C/Fortran ordering from a . The default is 'C'.


1 Answers

.T within groupby

def tgrp(df):
    df = df.drop('Name', axis=1)
    return df.reset_index(drop=True).T

df2.groupby('Name').apply(tgrp).unstack()

enter image description here


Explanation

groupby returns an object that contains information on how the original series or dataframe has been grouped. Instead of performing a groupby with a subsquent action of some sort, we could first assign the df2.groupby('Name') to a variable (I often do), say gb.

gb = df2.groupby('Name')

On this object gb we could call .mean() to get an average of each group. Or .last() to get the last element (row) of each group. Or .transform(lambda x: (x - x.mean()) / x.std()) to get a zscore transformation within each group. When there is something you want to do within a group that doesn't have a predefined function, there is still .apply().

.apply() for a groupby object is different than it is for a dataframe. For a dataframe, .apply() takes callable object as its argument and applies that callable to each column (or row) in the object. the object that is passed to that callable is a pd.Series. When you are using .apply in a dataframe context, it is helpful to keep this fact in mind. In the context of a groupby object, the object passed to the callable argument is a dataframe. In fact, that dataframe is one of the groups specified by the groupby.

When I write such functions to pass to groupby.apply, I typically define the parameter as df to reflect that it is a dataframe.

Ok, so we have:

df2.groupby('Name').apply(tgrp)

This generates a sub-dataframe for each 'Name' and passes that sub-dataframe to the function tgrp. Then the groupby object recombines all such groups having gone through the tgrp function back together again.

It'll look like this.

enter image description here

I took the OP's original attempt to simply transpose to heart. But I had to do some things first. Had I simply done:

df2[df2.Name == 'Jane'].T

enter image description here

df2[df2.Name == 'Joe'].T

enter image description here

Combining these manually (without groupby):

pd.concat([df2[df2.Name == 'Jane'].T, df2[df2.Name == 'Joe'].T])

enter image description here

Whoa! Now that's ugly. Obviously the index values of [0, 1, 2] don't mesh with [3, 4]. So let's reset.

pd.concat([df2[df2.Name == 'Jane'].reset_index(drop=True).T,
           df2[df2.Name == 'Joe'].reset_index(drop=True).T])

enter image description here

That's much better. But now we are getting into the territory groupby was intended to handle. So let it handle it.

Back to

df2.groupby('Name').apply(tgrp)

The only thing missing here is that we want to unstack the results to get the desired output.

enter image description here

like image 157
piRSquared Avatar answered Oct 24 '22 06:10

piRSquared