I don't understand the output of pandas' groupby. I started with a DataFrame (df0
) with 5 fields/columns (zip, city, location, population, state).
>>> df0.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29467 entries, 0 to 29466
Data columns (total 5 columns):
zip 29467 non-null object
city 29467 non-null object
loc 29467 non-null object
pop 29467 non-null int64
state 29467 non-null object
dtypes: int64(1), object(4)
memory usage: 1.1+ MB
I wanted to get the total population of each city, but since several cities have multiple zip codes, I thought I would use groupby.sum as follows:
df6 = df0.groupby(['city','state'])['pop'].sum()
However, this returned a Series instead of a DataFrame:
>>> df6.info()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/pandas/core/generic.py", line 2672, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'Series' object has no attribute 'info'
>>> type(df6)
<class 'pandas.core.series.Series'>
I would like to be able look up the population of any city with a method similar to
df0[df0['city'].isin(['ALBANY'])]
but since I have a Series instead of a DataFrame, I can't. I haven't been able to force a conversion into a DataFrame either.
What I'm now wondering is:
to_frame() function is used to convert the given series object to a dataframe. Parameter : name : The passed name should substitute for the series name (if it has one).
Transformations. Transformation on a group or a column returns an object that is indexed the same size of that is being grouped.
You can create a DataFrame from multiple Series objects by adding each series as a columns. By using concat() method you can merge multiple series together into DataFrame. This takes several params, for our scenario we use list that takes series to combine and axis=1 to specify merge series as columns instead of rows.
Need parameter as_index=False
in groupby
or reset_index
for convert MultiIndex
to columns:
df6 = df0.groupby(['city','state'], as_index=False)['pop'].sum()
Or:
df6 = df0.groupby(['city','state'])['pop'].sum().reset_index()
Sample:
df0 = pd.DataFrame({'city':['a','a','b'],
'state':['t','t','n'],
'pop':[7,8,9]})
print (df0)
city pop state
0 a 7 t
1 a 8 t
2 b 9 n
df6 = df0.groupby(['city','state'], as_index=False)['pop'].sum()
print (df6)
city state pop
0 a t 15
1 b n 9
df6 = df0.groupby(['city','state'])['pop'].sum().reset_index()
print (df6)
city state pop
0 a t 15
1 b n 9
Last select by loc
, for scalar add item()
:
print (df6.loc[df6.state == 't', 'pop'])
0 15
Name: pop, dtype: int64
print (df6.loc[df6.state == 't', 'pop'].item())
15
But if need only lookup table is possible use Series
with MultiIndex
:
s = df0.groupby(['city','state'])['pop'].sum()
print (s)
city state
a t 15
b n 9
Name: pop, dtype: int64
#select all cities by : and state by string like 't'
#output is Series of len 1
print (s.loc[:, 't'])
city
a 15
Name: pop, dtype: int64
#if need output as scalar add item()
print (s.loc[:, 't'].item())
15
It's hard to say definitively without sample data, but with the code you show, returning a Series, you should be able to access the population for a city by using something like df6.loc['Albany', 'NY']
(that is, index your grouped Series by the city and state).
The reason you get a Series is because you selected a single column ('pop')
on which to apply your group computation. If you apply your group computation to a list of columns, you'll get a DataFrame. You could do this by doing df6 = df0.groupby(['city','state'])[['pop']].sum()
. (Note the extra brackets around 'pop'
, to select a list of one column instead of a single column.) But I'm not sure there's a reason to do this if you can use the above method to access the city data anyway.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With