So I think this question can be visualized the best as following, given a dataframe:
val_1 true_val ID label
-0.0127894447 0.0 1 A
0.9604560385 1.0 2 A
0.0001271985 0.0 3 A
0.0007419337 0.0 3 B
0.3420448566 0.0 2 B
0.1322384726 1.0 4 B
So what I want to get is:
label ID_val_1_second_highest ID_true_val_highest
A 3 2
B 4 4
I want to get the ID that has the second highest value for val_1 and highest value for true_val (which is always the one with 1.0) and then return both corresponding ID's for every label.
Anyone have an idea how to do this? I tried:
result_at_one = result.set_index('ID').groupby('label').idxmax()
This works for giving me the highest value for both, but I only want the highest value for the true label while getting the second / third etc. highest value for the val_1 variable.
Someone linked this as answer: Pandas: Get N largest values and insert NaN values if there are no elements
However, If using that approach I need to group by label. So in that case the output would then become:
label true_id top1_id_val_1 top2_id_val_1 top3_id_val_1
A 2 2 3 1
B 4 2 4 3
Anyone knows how to this?
Finding Top 5 maximum value for each group can also be achieved while doing the group by. The function that is helpful for finding the Top 5 maximum value is nlargest(). The below article explains with the help of an example How to calculate Top 5 max values by Group in Pandas Python.
In order to get the count of unique values on multiple columns use pandas DataFrame. drop_duplicates() which drop duplicate rows from pandas DataFrame. This eliminates duplicates and return DataFrame with unique rows.
Pandas DataFrame: ge() function The ge() function returns greater than or equal to of dataframe and other, element-wise. Equivalent to ==, =!, <=, <, >=, > with support to choose axis (rows or columns) and level for comparison.
pandas.DataFrame.head() In Python's Pandas module, the Dataframe class provides a head() function to fetch top rows from a Dataframe i.e. It returns the first n rows from a dataframe.
You can use groupby
with a custom apply
function to achieve your desired result.
def sorted_maximums(group, nlargest, upto=False):
# Get the largest IDs in the current group
largest_ids = group.nlargest(nlargest, "val_1")["ID"]
index = ["val_1_ID_rank_{}".format(i) for i in range(1, nlargest+1)]
# Drop data if we're only interested in the nlargest value
# and none of the IDs leading up to it
if upto is False:
largest_ids = largest_ids.iloc[nlargest-1:]
index = index[-1:]
# Get the ID at the max "true_val"
true_val_max = group.at[group["true_val"].idxmax(), "ID"]
index += ["ID_true_val_highest"]
# Combine our IDs based on val_1 and our ID based on true_val
data = [*largest_ids, true_val_max]
return pd.Series(data, index=index)
df.groupby("label").apply(sorted_maximums, nlargest=2, upto=False).reset_index()
label val_1_ID_rank_2 ID_true_val_highest
0 A 3 2
1 B 4 4
df.groupby("label").apply(sorted_maximums, nlargest=2, upto=True).reset_index()
label val_1_ID_rank_1 val_1_ID_rank_2 ID_true_val_highest
0 A 2 3 2
1 B 2 4 4
Since I wasn't sure based on your question whether you were interested in getting the 2nd largest ID (@ val_1), or in getting the 1st, 2nd, AND 3rd highest ID @val_1 in one go I put in both methods. Changing upto=True will perform the latter, while upto=False will perform the former and solely get you the 1st, 2nd, OR 3rd highest ID @val_1
You can break it into stages :
# grouping is relatively inexpensive :
grouping = df.groupby("label")
# get second highest val
id_val = grouping.nth(-1)["ID"].rename("ID_val_1_second_highest")
#get highest true val
# you could also do df.true_val.eq(grouping.true_val.transform('max'))
# since we know the highest is 1, I just jumped into it
true_val = (df.loc[df.true_val == 1, ["ID", "label"]]
.set_index("label")
.rename( columns={"ID": "ID_true_val_highest"}))
# merge to get output :
pd.concat([id_val, true_val], axis=1,).reset_index()
label ID_val_1_second_highest ID_true_val_highest
0 A 3 2
1 B 4 4
After trying out a few methods (namely, sorting + ranking + melting, pivoting, groupby with custom functions), I've come to the conclusion that an expanded groupby is your best solution. (Best use for specialized cases like this one):
records = []
# Iterate through your groupby objects
for group_label, group_df in df[["label","ID","val_1"]].groupby("label"):
# get ranked indices
rank_idx = group_df["val_1"].rank()
# extract individual attributes
ID_true_val_highest = group_df.loc[rank.rank_idx[1], "ID"]
ID_val_1_second_highest = group_df.loc[rank.rank_idx[2], "ID"]
# store your observations
rec = {
"label":group_label,
"ID_true_val_highest":ID_true_val_highest,
"ID_val_1_second_highest":ID_val_1_second_highest,
}
records.append(rec)
# make into a dataframe
pd.DataFrame.from_records(records)
label ID_true_val_highest ID_val_1_second_highest
0 A 2.0 3.0
1 B 2.0 4.0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With