Suppose, I have a dataframe in pandas like below: <pre class="prettyprint"><code>campaignname category_type amount A cat_A_0 2.0 A cat_A_1 1.0 A cat_A_2 3.0 A cat_A_2 3.0 A cat_A_2 4.0 B cat_B_0 3.0 C cat_C_0 1.0 C cat_C_1 2.0 </code></pre> I am using the following code to group the above dataframe (say it's assigned variable name <code>df</code>) by different columns as follows: <pre class="prettyprint"><code>for name, gp in df.groupby('campaignname'): sorted_gp = gp.groupby(['campaignname', 'category_type']).sum().sort_values(['amount'], ascending=False) # I'd like to know how to select this in a cleaner/more concise way first_row = [sorted_gp.iloc[0].name[0], sorted_gp.iloc[0].name[1], sorted_gp.iloc[0].values.tolist()[0]] </code></pre> The purpose of the above code is to first <code>groupby</code> the raw data on <code>campaignname</code> column, then in each of the resulting group, I'd like to group again by both <code>campaignname</code> and <code>category_type</code>, and finally, sort by <code>amount</code> column to choose the first row that comes up (the one with the highest <code>amount</code> in each group. Specifically for the above example, I'd like to get results like this: <pre class="prettyprint"><code>first_row = ['A', 'cat_A_2', 4.0] # for the first group first_row = ['B', 'cat_B_0', 3.0] # for the second group first_row = ['C', 'cat_C_1', 2.0] # for the third group </code></pre> etc. As you can see, I'm using a rather (in my opinion) 'ugly' way to retrieve the first row of each sorted group, but since I'm new to pandas, I don't know a better/cleaner way to accomplish this. If anyone could let me know a way to select the first row in a sorted group from a pandas dataframe, I'd greatly appreciate it. Thank you in advance for your answers/suggestions!

IIUC you can do it this way: <pre class="prettyprint"><code>In [83]: df.groupby('campaignname', as_index=False) \ .apply(lambda x: x.nlargest(1, columns=['amount'])) \ .reset_index(level=1, drop=1) Out[83]: campaignname category_type amount 0 A cat_A_2 4.0 1 B cat_B_0 3.0 2 C cat_C_1 2.0 </code></pre> or: <pre class="prettyprint"><code>In [76]: df.sort_values('amount', ascending=False).groupby('campaignname').head(1) Out[76]: campaignname category_type amount 4 A cat_A_2 4.0 5 B cat_B_0 3.0 7 C cat_C_1 2.0 </code></pre>

My preferred way to do this is with <code>idxmax</code>. It returns the index of the maximum value. I subsequently use that index to slice <code>df</code> <pre class="prettyprint"><code>df.loc[df.groupby('campaignname').amount.idxmax()] campaignname category_type amount 4 A cat_A_2 4.0 5 B cat_B_0 3.0 7 C cat_C_1 2.0 </code></pre>

Selecting the first row of a sorted group from pandas data frame

Tags:

python

pandas

dataframe

group-by

numpy

Suppose, I have a dataframe in pandas like below:

campaignname    category_type    amount
A               cat_A_0            2.0
A               cat_A_1            1.0
A               cat_A_2            3.0
A               cat_A_2            3.0
A               cat_A_2            4.0
B               cat_B_0            3.0
C               cat_C_0            1.0
C               cat_C_1            2.0

I am using the following code to group the above dataframe (say it's assigned variable name df) by different columns as follows:

for name, gp in df.groupby('campaignname'):
    sorted_gp = gp.groupby(['campaignname', 'category_type']).sum().sort_values(['amount'], ascending=False)
    # I'd like to know how to select this in a cleaner/more concise way
    first_row = [sorted_gp.iloc[0].name[0], sorted_gp.iloc[0].name[1], sorted_gp.iloc[0].values.tolist()[0]]

The purpose of the above code is to first groupby the raw data on campaignname column, then in each of the resulting group, I'd like to group again by both campaignname and category_type, and finally, sort by amount column to choose the first row that comes up (the one with the highest amount in each group. Specifically for the above example, I'd like to get results like this:

first_row = ['A', 'cat_A_2', 4.0] # for the first group
first_row = ['B', 'cat_B_0', 3.0] # for the second group
first_row = ['C', 'cat_C_1', 2.0] # for the third group

etc.

As you can see, I'm using a rather (in my opinion) 'ugly' way to retrieve the first row of each sorted group, but since I'm new to pandas, I don't know a better/cleaner way to accomplish this. If anyone could let me know a way to select the first row in a sorted group from a pandas dataframe, I'd greatly appreciate it. Thank you in advance for your answers/suggestions!

948

asked Feb 11 '17 20:02

user1330974

2 Answers

IIUC you can do it this way:

In [83]: df.groupby('campaignname', as_index=False) \
           .apply(lambda x: x.nlargest(1, columns=['amount'])) \
           .reset_index(level=1, drop=1)
Out[83]:
  campaignname category_type  amount
0            A       cat_A_2     4.0
1            B       cat_B_0     3.0
2            C       cat_C_1     2.0

or:

In [76]: df.sort_values('amount', ascending=False).groupby('campaignname').head(1)
Out[76]:
  campaignname category_type  amount
4            A       cat_A_2     4.0
5            B       cat_B_0     3.0
7            C       cat_C_1     2.0

answered Sep 18 '22 12:09

MaxU - stop WAR against UA

My preferred way to do this is with idxmax. It returns the index of the maximum value. I subsequently use that index to slice df

df.loc[df.groupby('campaignname').amount.idxmax()]

  campaignname category_type  amount
4            A       cat_A_2     4.0
5            B       cat_B_0     3.0
7            C       cat_C_1     2.0

answered Sep 20 '22 12:09

piRSquared

Related questions
                            
                                Minute and second format for x label of matplotlib
                            
                                MemoryError when loading a JSON file
                            
                                Why is 211 used in plt.subplot(211)
                            
                                Sklearn predict multiple outputs
                            
                                RobotFramework Create Dictionary with an integer value instead of string
                            
                                Writing To CSV file Without Line Space in Python 3
                            
                                Falcon CORS middleware does not work properly
                            
                                How to get the globals from a module namespace?
                            
                                Tensorflow Retrain on Windows
                            
                                How to control the order that after_request handlers are executed?
                            
                                Starting from a specific point in a For loop
                            
                                How to use tf.while_loop() for variable-length inputs in tensorflow?
                            
                                Count characters in a string from a list of characters
                            
                                Returning top n values for group/multiindex in Pandas
                            
                                How to check whether or not a iterating variable NavigableString or Tag type?
                            
                                How to use Scala UDF in PySpark?
                            
                                WordNet Python words similarity
                            
                                How to custom sort an alphanumeric list?
                            
                                Python vlc install problems
                            
                                Pandas - Change AM/PM format to 24h

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With