How to get second highest value in a pandas column for a certain ID?

Tags:

pandas

So I think this question can be visualized the best as following, given a dataframe:

val_1          true_val ID      label
-0.0127894447       0.0  1       A
0.9604560385        1.0  2       A
0.0001271985        0.0  3       A
0.0007419337        0.0  3       B
0.3420448566        0.0  2       B
0.1322384726        1.0  4       B

So what I want to get is:

label  ID_val_1_second_highest    ID_true_val_highest
A        3                              2
B        4                              4

I want to get the ID that has the second highest value for val_1 and highest value for true_val (which is always the one with 1.0) and then return both corresponding ID's for every label.

Anyone have an idea how to do this? I tried:

result_at_one = result.set_index('ID').groupby('label').idxmax()

This works for giving me the highest value for both, but I only want the highest value for the true label while getting the second / third etc. highest value for the val_1 variable.

Someone linked this as answer: Pandas: Get N largest values and insert NaN values if there are no elements

However, If using that approach I need to group by label. So in that case the output would then become:

 label  true_id     top1_id_val_1             top2_id_val_1         top3_id_val_1
    A   2             2                          3               1
    B   4             2                          4               3

Anyone knows how to this?

209

asked Sep 15 '20 16:09

3 Answers

You can use groupby with a custom apply function to achieve your desired result.

def sorted_maximums(group, nlargest, upto=False):
    # Get the largest IDs in the current group
    largest_ids = group.nlargest(nlargest, "val_1")["ID"]
    index = ["val_1_ID_rank_{}".format(i) for i in range(1, nlargest+1)]
    
    # Drop data if we're only interested in the nlargest value
    #  and none of the IDs leading up to it
    if upto is False:
        largest_ids = largest_ids.iloc[nlargest-1:]
        index = index[-1:]
        
    # Get the ID at the max "true_val"
    true_val_max = group.at[group["true_val"].idxmax(), "ID"]
    index += ["ID_true_val_highest"]

    # Combine our IDs based on val_1 and our ID based on true_val
    data = [*largest_ids, true_val_max]
    return pd.Series(data, index=index)
    
df.groupby("label").apply(sorted_maximums, nlargest=2, upto=False).reset_index()

  label  val_1_ID_rank_2  ID_true_val_highest
0  A     3                2                  
1  B     4                4

df.groupby("label").apply(sorted_maximums, nlargest=2, upto=True).reset_index()

  label  val_1_ID_rank_1  val_1_ID_rank_2  ID_true_val_highest
0  A     2                3                2                  
1  B     2                4                4

Since I wasn't sure based on your question whether you were interested in getting the 2nd largest ID (@ val_1), or in getting the 1st, 2nd, AND 3rd highest ID @val_1 in one go I put in both methods. Changing upto=True will perform the latter, while upto=False will perform the former and solely get you the 1st, 2nd, OR 3rd highest ID @val_1

157

answered Oct 17 '22 11:10

Cameron Riddell

You can break it into stages :

# grouping is relatively inexpensive :
grouping = df.groupby("label")

# get second highest val
id_val = grouping.nth(-1)["ID"].rename("ID_val_1_second_highest")

#get highest true val
# you could also do df.true_val.eq(grouping.true_val.transform('max'))
# since we know the highest is 1, I just jumped into it 
    true_val = (df.loc[df.true_val == 1, ["ID", "label"]]
               .set_index("label")
               .rename( columns={"ID": "ID_true_val_highest"}))

 # merge to get output : 
 pd.concat([id_val, true_val], axis=1,).reset_index()

    label   ID_val_1_second_highest ID_true_val_highest
0       A      3                        2
1       B      4                        4

answered Oct 17 '22 11:10

sammywemmy

After trying out a few methods (namely, sorting + ranking + melting, pivoting, groupby with custom functions), I've come to the conclusion that an expanded groupby is your best solution. (Best use for specialized cases like this one):

records = []

# Iterate through your groupby objects
for group_label, group_df in df[["label","ID","val_1"]].groupby("label"):
    # get ranked indices
    rank_idx = group_df["val_1"].rank()
    # extract individual attributes
    ID_true_val_highest = group_df.loc[rank.rank_idx[1], "ID"]
    ID_val_1_second_highest = group_df.loc[rank.rank_idx[2], "ID"]

    # store your observations
    rec = {
        "label":group_label,
        "ID_true_val_highest":ID_true_val_highest,
        "ID_val_1_second_highest":ID_val_1_second_highest,
        }
    records.append(rec)
    
# make into a dataframe
pd.DataFrame.from_records(records)

    label   ID_true_val_highest ID_val_1_second_highest
0   A   2.0 3.0
1   B   2.0 4.0

answered Oct 17 '22 13:10

Yaakov Bressler

Related questions
                            
                                How to autoremove dependent Python packages within a pipenv when uninstalling a package?
                            
                                plotly: How to add text to existing figure?
                            
                                Auto place annotation bubble
                            
                                Drop last n rows within pandas dataframe groupby
                            
                                Poetry ignore dependency in pyproject.toml
                            
                                How to update image file realtime Pygame?
                            
                                download file using s3fs
                            
                                Altair: Create a mark_line chart with a max-min band similar to mark_errorband
                            
                                Update SQLAlchemy ORM existing model from posted Pydantic model in FastAPI?
                            
                                Authentication verification in Python based GraphQL Server using FastAPI
                            
                                How to add documentation for required query parameters?
                            
                                Cannot upload media files on CPanel (using django)
                            
                                Get full request URL from inside APIView in Django REST Framework
                            
                                How to ignore min & max value in group when calculating weighted mean by group in Pandas
                            
                                How to install python on Windows without an MSI installer?
                            
                                Plotly: How to animate a bar chart with multiple groups using plotly express?
                            
                                plotly automatic zooming for "Mapbox maps"
                            
                                what is the pytorch equivalent of a tensorflow linear regression?
                            
                                Plotly: How to show both a normal distribution and a kernel density estimation in a histogram?
                            
                                How to convert a torch tensor into a byte string?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to get second highest value in a pandas column for a certain ID?

Tags:

python

pandas

stacksonoverflow

People also ask

3 Answers

Cameron Riddell

sammywemmy

Yaakov Bressler

Recent Activity

Donate For Us