Say I have a dataframe of test scores of students, where each student studies different subjects. Each student can take the test for each subject multiple times, and only the highest score (out of 100) will be retained. For instance, say I have a dataframe of all test records:
| student_name | subject | test_number | score |
|--------------|---------|-------------|-------|
| sarah | maths | test1 | 78 |
| sarah | maths | test2 | 71 |
| sarah | maths | test3 | 83 |
| sarah | physics | test1 | 91 |
| sarah | physics | test2 | 97 |
| sarah | history | test1 | 83 |
| sarah | history | test2 | 87 |
| joan | maths | test1 | 83 |
| joan | maths | test2 | 88 |
(1) How do I keep only the test records (rows) with the maximum score? That is,
| student_name | subject | test_number | score |
|--------------|---------|-------------|-------|
| sarah | maths | test1 | 78 |
| sarah | maths | test2 | 71 |
| sarah | maths | test3 | 83 |
| sarah | physics | test1 | 91 |
(2) How would I keep the average of all tests taken for the same subject, for the same student? That is:
| student_name | subject | test_number | ave_score |
|--------------|---------|-------------|-----------|
| sarah | maths | na | 77.333 |
| sarah | maths | na | 94 |
| sarah | maths | na | 85 |
| sarah | physics | na | 85.5 |
I've tried various combinations of df.sort_values()
and df.drop_duplicates(subset=..., keep=...)
, to no avail.
Actual Data
| query | target | pct-similarity | p-val | aln_length | bit-score |
|-------|----------|----------------|-------|------------|-----------|
| EV239 | B/Fw6/623 | 99.23 | 0.966 | 832 | 356 |
| EV239 | B/Fw6/623 | 97.34 | 0.982 | 1022 | 739 |
| EV239 | MMS-alpha | 92.23 | 0.997 | 838 | 384 |
| EV239 | MMS-alpha | 93.49 | 0.993 | 1402 | 829 |
| EV380 | B/Fw6/623 | 94.32 | 0.951 | 324 | 423 |
| EV380 | B/Fw6/623 | 95.27 | 0.932 | 1245 | 938 |
| EV380 | MMS-alpha | 99.23 | 0.927 | 723 | 522 |
| EV380 | MMS-alpha | 99.15 | 0.903 | 948 | 1092 |
After aggregation function is applied, only the column pct-similarity
will be of interest.
(1) Drop duplicate query+target rows, by choosing the maximum aln_length
. Retain the pct-similarity
value that belongs to the row with maximum aln_length
.
(2) Aggregate duplicate query+target rows by choosing the row with maximum aln_length
, and computing the average pct-similarity
for that set of duplicate rows. The other numerical columns aren't necessary and will be dropped eventually, so I really don't care what aggregation function (max or mean) is applied to them.
Just use max()
to each group of student/subject:
df.groupby(["student_name","subject"], as_index=False).max()
student_name subject test_number score
0 joan maths test2 88
1 sarah history test2 87
2 sarah maths test3 83
3 sarah physics test2 97
For the average, this use mean()
instead:
df.groupby(["student_name","subject"], as_index=False).mean()
student_name subject score
0 joan maths 85.500000
1 sarah history 85.000000
2 sarah maths 77.333333
3 sarah physics 94.000000
Most likely describe
can
df.groupby(["student_name","subject"]).score.describe()
Out[15]:
count mean std min 25% 50% \
student_name subject
joan maths 2.0 85.500000 3.535534 83.0 84.25 85.5
sarah history 2.0 85.000000 2.828427 83.0 84.00 85.0
maths 3.0 77.333333 6.027714 71.0 74.50 78.0
physics 2.0 94.000000 4.242641 91.0 92.50 94.0
75% max
student_name subject
joan maths 86.75 88.0
sarah history 86.00 87.0
maths 80.50 83.0
physics 95.50 97.0
And with drop_duplicates
df.sort_values('score').drop_duplicates(["student_name","subject"],keep='last')
Out[22]:
student_name subject test_number score
2 sarah maths test3 83
6 sarah history test2 87
8 joan maths test2 88
4 sarah physics test2 97
For mean
value with reindex
df.groupby(["student_name","subject"], as_index=False).mean().reindex(columns=df.columns)
Out[24]:
student_name subject test_number score
0 joan maths NaN 85.500000
1 sarah history NaN 85.000000
2 sarah maths NaN 77.333333
3 sarah physics NaN 94.000000
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With