Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

unstacking data with Pandas

Tags:

python

pandas

I have some data that I'm taking from 'long' to 'wide'. I have no problem using unstack to make the data wide, but then I end up with what looks like an index which I can't get rid of. Here's a dummy example:

## set up some dummy data
import pandas as pd
d = {'state'  : ['a','b','a','b','a','b','a','b'],
     'year' : [1,1,1,1,2,2,2,2],
     'description'  : ['thing1','thing1','thing1','thing2','thing2','thing2','thing1','thing2'],
     'value' : [1., 2., 3., 4.,1., 2., 3., 4.]}
df = pd.DataFrame(d)
## now that we have dummy data do the long to wide conversion

dfGrouped = df.groupby(['state','year', 'description']).value.sum() 

dfUnstacked = dfGrouped.unstack('description')
print dfUnstacked


description  thing1  thing2
state year                 
a     1           4     NaN
      2           3       1
b     1           2       4
      2         NaN       6

So that looks like what I would expect. Now I'd like an unindexed data frame with columns 'state', 'year', 'thing1', 'thing2'. So it seems I should do thus:

dfUnstackedNoIndex = dfUnstacked.reset_index()
print dfUnstackedNoIndex

description state  year  thing1  thing2
0               a     1       4     NaN
1               a     2       3       1
2               b     1       2       4
3               b     2     NaN       6

Ok, that's close. But I don't want description carried forward. So let's select out only the columns I want:

print dfUnstackedNoIndex[['state','year','thing1','thing2']]

description state  year  thing1  thing2
0               a     1       4     NaN
1               a     2       3       1
2               b     1       2       4
3               b     2     NaN       6

So what's up with 'description'? Why does it hang out even though I reset the index and selected only a few columns? Clearly I'm not groking something right.

FWIW, my Pandas version is 0.12

like image 804
JD Long Avatar asked Dec 30 '13 21:12

JD Long


1 Answers

description is the name of the columns. You can get rid of that like this:

In [74]: dfUnstackedNoIndex.columns.name = None

In [75]: dfUnstackedNoIndex
Out[75]: 
  state  year  thing1  thing2
0     a     1       4     NaN
1     a     2       3       1
2     b     1       2       4
3     b     2     NaN       6

The purpose of column names perhaps becomes clearer when you look at what happens when you unstack twice:

In [107]: dfUnstacked2 = dfUnstacked.unstack('state')
In [108]: dfUnstacked2
Out[108]: 
description  thing1      thing2   
state             a   b       a  b
year                              
1                 4   2     NaN  4
2                 3 NaN       1  6

Now dfUnstacked2.columns is a MultiIndex. Each level has a name which corresponds to the name of the index level that has been converted into a column level.

In [111]: dfUnstacked2.columns
Out[111]: 
MultiIndex(levels=[[u'thing1', u'thing2'], [u'a', u'b']],
           labels=[[0, 0, 1, 1], [0, 1, 0, 1]],
           names=[u'description', u'state'])

Column names and index names show up in the same place in the string representation of DataFrames, so it can be hard to know which is which. You can figure it out by inspecting df.index.names and df.columns.names.

like image 108
unutbu Avatar answered Oct 21 '22 15:10

unutbu