I have some data that I'm taking from 'long' to 'wide'. I have no problem using unstack
to make the data wide, but then I end up with what looks like an index which I can't get rid of. Here's a dummy example:
## set up some dummy data
import pandas as pd
d = {'state' : ['a','b','a','b','a','b','a','b'],
'year' : [1,1,1,1,2,2,2,2],
'description' : ['thing1','thing1','thing1','thing2','thing2','thing2','thing1','thing2'],
'value' : [1., 2., 3., 4.,1., 2., 3., 4.]}
df = pd.DataFrame(d)
## now that we have dummy data do the long to wide conversion
dfGrouped = df.groupby(['state','year', 'description']).value.sum()
dfUnstacked = dfGrouped.unstack('description')
print dfUnstacked
description thing1 thing2
state year
a 1 4 NaN
2 3 1
b 1 2 4
2 NaN 6
So that looks like what I would expect. Now I'd like an unindexed data frame with columns 'state', 'year', 'thing1', 'thing2'. So it seems I should do thus:
dfUnstackedNoIndex = dfUnstacked.reset_index()
print dfUnstackedNoIndex
description state year thing1 thing2
0 a 1 4 NaN
1 a 2 3 1
2 b 1 2 4
3 b 2 NaN 6
Ok, that's close. But I don't want description carried forward. So let's select out only the columns I want:
print dfUnstackedNoIndex[['state','year','thing1','thing2']]
description state year thing1 thing2
0 a 1 4 NaN
1 a 2 3 1
2 b 1 2 4
3 b 2 NaN 6
So what's up with 'description'? Why does it hang out even though I reset the index and selected only a few columns? Clearly I'm not groking something right.
FWIW, my Pandas version is 0.12
description
is the name of the columns. You can get rid of that like this:
In [74]: dfUnstackedNoIndex.columns.name = None
In [75]: dfUnstackedNoIndex
Out[75]:
state year thing1 thing2
0 a 1 4 NaN
1 a 2 3 1
2 b 1 2 4
3 b 2 NaN 6
The purpose of column names perhaps becomes clearer when you look at what happens when you unstack twice:
In [107]: dfUnstacked2 = dfUnstacked.unstack('state')
In [108]: dfUnstacked2
Out[108]:
description thing1 thing2
state a b a b
year
1 4 2 NaN 4
2 3 NaN 1 6
Now dfUnstacked2.columns
is a MultiIndex
. Each level
has a name
which corresponds to the name of the index level that has been converted into a column level.
In [111]: dfUnstacked2.columns
Out[111]:
MultiIndex(levels=[[u'thing1', u'thing2'], [u'a', u'b']],
labels=[[0, 0, 1, 1], [0, 1, 0, 1]],
names=[u'description', u'state'])
Column names and index names show up in the same place in the string representation of DataFrames, so it can be hard to know which is which. You can figure it out by inspecting df.index.names
and df.columns.names
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With