Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Select highest values in a dataframe by group

Tags:

dataframe

r

I have the following df

dat <- data.frame(Cases = c("Student3","Student3","Student3","Student1","Student1",
"Student2","Student2","Student2","Student4"), Class = rep("Math", 9),
Scores = c(9,5,2,7,3,8,5,1,7), stringsAsFactors = F)


> dat
   Cases    Class   Scores
1 Student3  Math      9
2 Student3  Math      5
3 Student3  Math      2
4 Student1  Math      7
5 Student1  Math      3
6 Student2  Math      8
7 Student2  Math      5
8 Student2  Math      1
9 Student4  Math      7

On the other hand, I have another df with the following information:

d <- data.frame(Cases = c("Student3", "Student1",
"Student2", "Student4"), Class = rep("Math", 4), stringsAsFactors = F)

    Cases  Class
1 Student3  Math
2 Student1  Math
3 Student2  Math
4 Student4  Math

With these two, I want to extract the highest scores for each student. So my output would look like this:

> dat_output
    Cases  Class   Scores
1 Student3  Math      9
2 Student1  Math      7
3 Student2  Math      8
4 Student4  Math      7

I tried with merge but it is not extracting just the highest scores.

like image 220
Cahidora Avatar asked Aug 17 '18 07:08

Cahidora


Video Answer


2 Answers

We can use sapply on each Cases in d, subset the dat for that Cases and get the max score for it.

sapply(d$Cases, function(x) max(dat$Scores[dat$Cases %in% x]))

#Student3 Student1 Student2 Student4 
#       9        7        8        7 

To get the result as data.frame

transform(d, Scores = sapply(d$Cases, function(x) 
                     max(dat$Scores[dat$Cases %in% x])))

#    Cases Class Scores
# Student3  Math      9 
# Student1  Math      7
# Student2  Math      8
# Student4  Math      7

Note - I have assumed your d to be

d <- data.frame(Cases = c("Student3", "Student1",
      "Student2", "Student4"), Class = rep("Math", 4), stringsAsFactors = F)
like image 169
Ronak Shah Avatar answered Sep 23 '22 19:09

Ronak Shah


If I am correct you don't need d, since in d there is no additional information that is not in dat already.

You can just do:

dat_output <- aggregate(Scores ~ Cases, dat, max)
dat_output

     Cases Scores
1 Student1      7
2 Student2      8
3 Student3      9
4 Student4      7
like image 24
Lennyy Avatar answered Sep 26 '22 19:09

Lennyy