I've frequented used pandas' <code>agg()</code> function to run summary statistics on every column of a data.frame. For example, here's how you would produce the mean and standard deviation: <pre class="prettyprint"><code>df = pd.DataFrame({'A': ['group1', 'group1', 'group2', 'group2', 'group3', 'group3'], 'B': [10, 12, 10, 25, 10, 12], 'C': [100, 102, 100, 250, 100, 102]}) >>> df [output] A B C 0 group1 10 100 1 group1 12 102 2 group2 10 100 3 group2 25 250 4 group3 10 100 5 group3 12 102 </code></pre> In both of those cases, the order that individual rows are sent to the agg function does not matter. But consider the following example, which: <pre class="prettyprint"><code>df.groupby('A').agg([np.mean, lambda x: x.iloc[1] ]) [output] mean <lambda> mean <lambda> A group1 11.0 12 101 102 group2 17.5 25 175 250 group3 11.0 12 101 102 </code></pre> In this case the lambda functions as intended, outputting the second row in each group. However, I have not been able to find anything in the pandas documentation that implies that this is guaranteed to be true in all cases. I want use <code>agg()</code> along with a weighted average function, so I want to be sure that the rows that come into the function will be in the same order as they appear in the original data frame. Does anyone know, ideally via somewhere in the docs or pandas source code, if this is guaranteed to be the case?

See this enhancement issue The short answer is yes, the groupby will preserve the orderings as passed in. You can prove this by using your example like this: <pre class="prettyprint"><code>In [20]: df.sort_index(ascending=False).groupby('A').agg([np.mean, lambda x: x.iloc[1] ]) Out[20]: B C mean <lambda> mean <lambda> A group1 11.0 10 101 100 group2 17.5 10 175 100 group3 11.0 10 101 100 </code></pre> This is NOT true for resample however as it requires a monotonic index (it WILL work with a non-monotonic index, but will sort it first). Their is a <code>sort=</code> flag to groupby, but this relates to the sorting of the groups themselves and not the observations within a group. FYI: <code>df.groupby('A').nth(1)</code> is a safe way to get the 2nd value of a group (as your method above will fail if a group has < 2 elements)

Panda's 0.19.1 doc says "groupby preserves the order of rows within each group", so this is guaranteed behavior. http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html

Python Pandas: Is Order Preserved When Using groupby() and agg()?

Tags:

python

pandas

aggregate

I've frequented used pandas' agg() function to run summary statistics on every column of a data.frame. For example, here's how you would produce the mean and standard deviation:

df = pd.DataFrame({'A': ['group1', 'group1', 'group2', 'group2', 'group3', 'group3'],
                   'B': [10, 12, 10, 25, 10, 12],
                   'C': [100, 102, 100, 250, 100, 102]})

>>> df
[output]
        A   B    C
0  group1  10  100
1  group1  12  102
2  group2  10  100
3  group2  25  250
4  group3  10  100
5  group3  12  102

In both of those cases, the order that individual rows are sent to the agg function does not matter. But consider the following example, which:

df.groupby('A').agg([np.mean, lambda x: x.iloc[1] ])

[output]

        mean  <lambda>  mean  <lambda>
A                                     
group1  11.0        12   101       102
group2  17.5        25   175       250
group3  11.0        12   101       102

In this case the lambda functions as intended, outputting the second row in each group. However, I have not been able to find anything in the pandas documentation that implies that this is guaranteed to be true in all cases. I want use agg() along with a weighted average function, so I want to be sure that the rows that come into the function will be in the same order as they appear in the original data frame.

Does anyone know, ideally via somewhere in the docs or pandas source code, if this is guaranteed to be the case?

436

asked Oct 19 '14 22:10

BringMyCakeBack

4 Answers

See this enhancement issue

The short answer is yes, the groupby will preserve the orderings as passed in. You can prove this by using your example like this:

In [20]: df.sort_index(ascending=False).groupby('A').agg([np.mean, lambda x: x.iloc[1] ])
Out[20]: 
           B             C         
        mean <lambda> mean <lambda>
A                                  
group1  11.0       10  101      100
group2  17.5       10  175      100
group3  11.0       10  101      100

This is NOT true for resample however as it requires a monotonic index (it WILL work with a non-monotonic index, but will sort it first).

Their is a sort= flag to groupby, but this relates to the sorting of the groups themselves and not the observations within a group.

FYI: df.groupby('A').nth(1) is a safe way to get the 2nd value of a group (as your method above will fail if a group has < 2 elements)

answered Oct 19 '22 03:10

Jeff

Panda's 0.19.1 doc says "groupby preserves the order of rows within each group", so this is guaranteed behavior.

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html

answered Oct 19 '22 02:10

Uwe Mayer

In order to preserve order, you'll need to pass .groupby(..., sort=False). In your case the grouping column is already sorted, so it does not make difference, but generally one must use the sort=False flag:

 df.groupby('A', sort=False).agg([np.mean, lambda x: x.iloc[1] ])

answered Oct 19 '22 02:10

Dima Lituiev

Reference: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html

The API accepts "SORT" as an argument.

Description for SORT argument is like this:

sort : bool, default True Sort group keys. Get better performance by turning this off. Note this does not influence the order of observations within each group. Groupby preserves the order of rows within each group.

Thus, it is clear the "Groupby" does preserve the order of rows within each group.

answered Oct 19 '22 02:10

Jigidi Sarnath

Related questions
                            
                                Django download a file
                            
                                Adding a module (Specifically pymorph) to Spyder (Python IDE)
                            
                                python save plotly plot to local file and insert into html
                            
                                Import psycopg2 Library not loaded: libssl.1.0.0.dylib
                            
                                Map list item to function with arguments
                            
                                Iterating over a 2 dimensional python list [duplicate]
                            
                                How to easily distribute Python software that has Python module dependencies? Frustrations in Python package installation on Unix
                            
                                Python function argument list formatting
                            
                                How do I correctly install dulwich to get hg-git working on Windows?
                            
                                Should I use `random.seed` or `numpy.random.seed` to control random number generation in `scikit-learn`?
                            
                                Can I get a reference to a Python property?
                            
                                Store different datatypes in one NumPy array?
                            
                                Releasing memory of huge numpy array in IPython
                            
                                How should I stop a busy cell in an iPython notebook?
                            
                                How to properly use coverage.py in Python?
                            
                                \text does not work in a matplotlib label
                            
                                Get the column names of a python numpy ndarray
                            
                                Are Python built-in containers thread-safe?
                            
                                TypeError: unhashable type: 'list' when using built-in set function
                            
                                Python debugger: Stepping into a function that you have called interactively

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Python Pandas: Is Order Preserved When Using groupby() and agg()?

Tags:

python

pandas

aggregate

BringMyCakeBack

People also ask

4 Answers

Jeff

Uwe Mayer

Dima Lituiev

Jigidi Sarnath

Recent Activity

Donate For Us