Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Function for duplicate rows

Tags:

r

I have a dataframe like below:

> df
     pat_id disease
[1,] "pat1" "dis1" 
[2,] "pat1" "dis1" 
[3,] "pat2" "dis0" 
[4,] "pat2" "dis5" 
[5,] "pat3" "dis2" 
[6,] "pat3" "dis2" 

How can I write a function to get a third variable which indicates for the same pat_id the disease variable is the same or not , like below?

> df
     pat_id disease var3
[1,] "pat1" "dis1"  "1" 
[2,] "pat1" "dis1"  "1" 
[3,] "pat2" "dis0"  "0" 
[4,] "pat2" "dis5"  "0" 
[5,] "pat3" "dis2"  "1" 
[6,] "pat3" "dis2"  "1" 
like image 380
trillian Avatar asked Nov 17 '25 12:11

trillian


1 Answers

Try ave() for the groupings, and wrap the result from any(duplicated()), with as.integer(). Then bind with cbind(). Although I might recommend you use a data frame instead of a matrix here.

cbind(
    df, 
    var3 = ave(df[,2], df[,1], FUN = function(x) as.integer(any(duplicated(x)))
)
#      pat_id disease var3
# [1,] "pat1" "dis1"  "1" 
# [2,] "pat1" "dis1"  "1" 
# [3,] "pat2" "dis0"  "0" 
# [4,] "pat2" "dis5"  "0" 
# [5,] "pat3" "dis2"  "1" 
# [6,] "pat3" "dis2"  "1" 

For larger data, I would recommend converting to a data table. The syntax is actually a bit nicer too, and it will likely be faster.

library(data.table)
dt <- as.data.table(df)
dt[, var3 := if(any(duplicated(disease))) 1 else 0, by = pat_id]

which gives

   pat_id disease var3
1:   pat1    dis1    1
2:   pat1    dis1    1
3:   pat2    dis0    0
4:   pat2    dis5    0
5:   pat3    dis2    1
6:   pat3    dis2    1

where column classes will be more appropriate (char, char, int). Or you could use as.integer(any(duplicated(disease))) instead of if/else.

like image 164
Rich Scriven Avatar answered Nov 19 '25 08:11

Rich Scriven



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!