How is pandas groupby method actually working?

Tags:

So I was trying to understand pandas.dataFrame.groupby() function and I came across this example on the documentation:

    In [1]: df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
   ...:                           'foo', 'bar', 'foo', 'foo'],
   ...:                    'B' : ['one', 'one', 'two', 'three',
   ...:                           'two', 'two', 'one', 'three'],
   ...:                    'C' : np.random.randn(8),
   ...:                    'D' : np.random.randn(8)})
   ...: 

In [2]: df
Out[2]: 
     A      B         C         D
0  foo    one  0.469112 -0.861849
1  bar    one -0.282863 -2.104569
2  foo    two -1.509059 -0.494929
3  bar  three -1.135632  1.071804
4  foo    two  1.212112  0.721555
5  bar    two -0.173215 -0.706771
6  foo    one  0.119209 -1.039575
7  foo  three -1.044236  0.271860

Not to further explore I did this:

print(df.groupby('B').head())

it outputs the same dataFrame but when I do this:

print(df.groupby('B'))

it gives me this:

<pandas.core.groupby.DataFrameGroupBy object at 0x7f65a585b390>

What does this mean? In a normal dataFrame printing .head() simply outputs the first 5 rows what's happening here?

And also why does printing .head() gives the same output as the dataframe? Shouldn't it be grouped by the elements of the column 'B'?

254

asked Jul 22 '17 13:07

aroma

1 Answers

When you use just

df.groupby('A')

You get a GroupBy object. You haven't applied any function to it at that point. Under the hood, while this definition might not be perfect, you can think of a groupby object as:

An iterator of (group, DataFrame) pairs, for DataFrames, or
An iterator of (group, Series) pairs, for Series.

To illustrate:

df = DataFrame({'A' : [1, 1, 2, 2], 'B' : [1, 2, 3, 4]})
grouped = df.groupby('A')

# each `i` is a tuple of (group, DataFrame)
# so your output here will be a little messy
for i in grouped:
    print(i)
(1,    A  B
0  1  1
1  1  2)
(2,    A  B
2  2  3
3  2  4)

# this version uses multiple counters
# in a single loop.  each `group` is a group, each
# `df` is its corresponding DataFrame
for group, df in grouped:
    print('group of A:', group, '\n')
    print(df, '\n')
group of A: 1 

   A  B
0  1  1
1  1  2 

group of A: 2 

   A  B
2  2  3
3  2  4 

# and if you just wanted to visualize the groups,
# your second counter is a "throwaway"
for group, _ in grouped:
    print('group of A:', group, '\n')
group of A: 1 

group of A: 2

Now as for .head. Just have a look at the docs for that method:

Essentially equivalent to .apply(lambda x: x.head(n))

So here you're actually applying a function to each group of the groupby object. Keep in mind .head(5) is applied to each group (each DataFrame), so because you have less than or equal to 5 rows per group, you get your original DataFrame.

Consider this with the example above. If you use .head(1), you get only the first 1 row of each group:

print(df.groupby('A').head(1))
   A  B
0  1  1
2  2  3

134

answered Oct 03 '22 10:10

Brad Solomon

Related questions
                            
                                limited number of user-initiated background processes
                            
                                pandas, convert DataFrame to MultiIndex'ed DataFrame
                            
                                Saving objects and their related objects at the same time in Django
                            
                                pandas dataframe : add & remove prefix/suffix from all cell values of entire dataframe
                            
                                APScheduler missing jobs after adding misfire_grace_time
                            
                                How to convert a matrix into column array with PANDAS / Python
                            
                                How to calculate perplexity of RNN in tensorflow
                            
                                Calling a parent method from outside the child
                            
                                Adding markers or lines to colorbar in matplotlib
                            
                                How to close web browser using python
                            
                                How do I add cv2 as a requirement in a python package?
                            
                                Regex add character to matched string
                            
                                Why does "pip install" not include my package_data files?
                            
                                ImportError: Missing required dependencies ['numpy']
                            
                                Django Middleware Error - Middleware changed for 1.7
                            
                                Running Scrapy from a script with file output
                            
                                How to parse ld+json using python
                            
                                matplotlib: hide subplot and fill space with other subplots
                            
                                coreapi only lists list and read method, even when user is logged
                            
                                assign in pandas pipeline

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How is pandas groupby method actually working?

Tags:

python

pandas

dataframe

aroma

People also ask

1 Answers

Brad Solomon

Recent Activity

Donate For Us