Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to remove duplicates in R

Tags:

r

I have a very large data set, and it looks like this one below: df <- data.frame(school=c("a", "a", "a", "b","b","c","c","c"), year=c(3,3,1,4,2,4,3,1), GPA=c(4,4,4,3,3,3,2,2))

school year GPA
  a    3   4
  a    3   4
  a    1   4
  b    4   3
  b    2   3
  c    4   3
  c    3   2
  c    1   2

and I want it to be look like:

school year GPA
 a    3   4
 a    3   4
 b    4   3
 c    4   3

So basically, what I want is for each given school, I want their top year student(students), regardless of the GPA.

I have tried:

new_df <- df[!duplicated(paste(df[,1],df[,2])),] but this gives me the unique combination between the school and year.

while the one below gives me the unique school new_df2 <- df[!duplicated(df$school),]

like image 392
user1489597 Avatar asked Dec 21 '22 17:12

user1489597


1 Answers

Using the plyr library

require(plyr)
ddply(df,.(school),function(x){x[x$year==max(x$year),]})
> ddply(df,.(school),function(x){x[x$year==max(x$year),]})
  school year GPA
1      a    3   4
2      a    3   4
3      b    4   3
4      c    4   3

or base

test<-lapply(split(df,df$school),function(x){x[x$year==max(x$year),]})
out<-do.call(rbind,test)
> out
    school year GPA
a.1      a    3   4
a.2      a    3   4
b        b    4   3
c        c    4   3

Explanation: split splits the dataframe into a list by schools.

dat<-split(df,df$school)

> dat
$a
  school year GPA
1      a    3   4
2      a    3   4
3      a    1   4

$b
  school year GPA
4      b    4   3
5      b    2   3

$c
  school year GPA
6      c    4   3
7      c    3   2
8      c    1   2

for each school we want the members in the top year.

dum.fun<-function(x){x[x$year==max(x$year),]}

> dum.fun(dat$a)
  school year GPA
1      a    3   4
2      a    3   4

lapply applies a function over the members of a list and outputs a list

> lapply(split(df,df$school),function(x){x[x$year==max(x$year),]})
$a
  school year GPA
1      a    3   4
2      a    3   4

$b
  school year GPA
4      b    4   3

$c
  school year GPA
6      c    4   3

this is what we want but in list form. We need to bind the members of the list together. We do this by calling rbind on the members successively using do.call.

like image 168
shhhhimhuntingrabbits Avatar answered Jan 25 '23 07:01

shhhhimhuntingrabbits