How can I get pandas' groupby command to return a DataFrame instead of a Series?

Tags:

pandas

I don't understand the output of pandas' groupby. I started with a DataFrame (df0) with 5 fields/columns (zip, city, location, population, state).

Click to copy

 >>> df0.info()
 <class 'pandas.core.frame.DataFrame'>
 RangeIndex: 29467 entries, 0 to 29466
 Data columns (total 5 columns):
 zip      29467 non-null object
 city     29467 non-null object
 loc      29467 non-null object
 pop      29467 non-null int64
 state    29467 non-null object
 dtypes: int64(1), object(4)
 memory usage: 1.1+ MB

I wanted to get the total population of each city, but since several cities have multiple zip codes, I thought I would use groupby.sum as follows:

Click to copy

  df6 = df0.groupby(['city','state'])['pop'].sum()

However, this returned a Series instead of a DataFrame:

Click to copy

 >>> df6.info()
 Traceback (most recent call last):
   File "<stdin>", line 1, in <module>
   File "/usr/local/lib/python2.7/dist-packages/pandas/core/generic.py", line 2672, in __getattr__
     return object.__getattribute__(self, name)
  AttributeError: 'Series' object has no attribute 'info'
 >>> type(df6)
 <class 'pandas.core.series.Series'>

I would like to be able look up the population of any city with a method similar to

Click to copy

 df0[df0['city'].isin(['ALBANY'])]

but since I have a Series instead of a DataFrame, I can't. I haven't been able to force a conversion into a DataFrame either.

What I'm now wondering is:

Why didn't I get a DataFrame back instead of a Series?
How can I get a table that will let me look up the population of a city? Can I use the Series I got from groupby, or should I have taken a different approach?

827

asked Feb 19 '17 05:02

2 Answers

Need parameter as_index=False in groupby or reset_index for convert MultiIndex to columns:

Click to copy

df6 = df0.groupby(['city','state'], as_index=False)['pop'].sum()

Or:

Click to copy

df6 = df0.groupby(['city','state'])['pop'].sum().reset_index()

Sample:

Click to copy

df0 = pd.DataFrame({'city':['a','a','b'],
                   'state':['t','t','n'],
                   'pop':[7,8,9]})

print (df0)
  city  pop state
0    a    7     t
1    a    8     t
2    b    9     n

df6 = df0.groupby(['city','state'], as_index=False)['pop'].sum()
print (df6)
  city state  pop
0    a     t   15
1    b     n    9

Click to copy

df6 = df0.groupby(['city','state'])['pop'].sum().reset_index()
print (df6)
  city state  pop
0    a     t   15
1    b     n    9

Last select by loc, for scalar add item():

Click to copy

print (df6.loc[df6.state == 't', 'pop'])
0    15
Name: pop, dtype: int64

print (df6.loc[df6.state == 't', 'pop'].item())
15

But if need only lookup table is possible use Series with MultiIndex:

Click to copy

s = df0.groupby(['city','state'])['pop'].sum()
print (s)
city  state
a     t        15
b     n         9
Name: pop, dtype: int64

#select all cities by : and state by string like 't'
#output is Series of len 1
print (s.loc[:, 't'])
city
a    15
Name: pop, dtype: int64

#if need output as scalar add item()
print (s.loc[:, 't'].item())
15

answered Nov 14 '22 21:11

It's hard to say definitively without sample data, but with the code you show, returning a Series, you should be able to access the population for a city by using something like df6.loc['Albany', 'NY'] (that is, index your grouped Series by the city and state).

The reason you get a Series is because you selected a single column ('pop') on which to apply your group computation. If you apply your group computation to a list of columns, you'll get a DataFrame. You could do this by doing df6 = df0.groupby(['city','state'])[['pop']].sum(). (Note the extra brackets around 'pop', to select a list of one column instead of a single column.) But I'm not sure there's a reason to do this if you can use the above method to access the city data anyway.

answered Nov 14 '22 23:11

BrenBarn

Related questions
                            
                                How to remove NaN from a Pandas Series where the dtype is a list?
                            
                                addHow to make django post_save signal run only during creation
                            
                                How can I configure IPython to issue the same "magic" commands at every startup?
                            
                                Finding minimum value for each level of a multi-index dataframe
                            
                                python logging: sending StreamHandler to file from command line
                            
                                No response from celery worker with TensorFlow
                            
                                use AWS APIs with Python to use Polly Services
                            
                                Correlation between a pandas Series and a whole DataFrame
                            
                                object of type '_csv.reader' has no len(), csv data not recognized
                            
                                is boto3 supported by ansible?
                            
                                ImportError: No module named custom storages - django-storages boto
                            
                                Python's dir(object) and __builtin__ equivalent in Julia
                            
                                Calculate the sum of model properties in Django
                            
                                TensorArray TensorArray_1_0: Could not read from TensorArray index 0 because it has not yet been written to
                            
                                Importing tensorflow when embedding python in c++ returns null
                            
                                Paramiko: nest ssh session to another machine while preserving paramiko functionality (ProxyJump)
                            
                                TensorFlow - How to predict with trained model on a different test dataset?
                            
                                docker stucks when executing time.sleep(1) in a python loop
                            
                                Python Pandas groupby: filter according to condition on values
                            
                                Python - something faster than 2 nested for loops

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I get pandas' groupby command to return a DataFrame instead of a Series?

Tags:

python

pandas

user1245262

People also ask

2 Answers

jezrael

BrenBarn

Recent Activity

Donate For Us