Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

identify and mark duplicate rows in r

Tags:

dataframe

r

I would like to identify and mark duplicate rows based on 2 columns. I would like to make a unique identifier for each duplicate so I know not just that the row is a duplicate, but which row it is a duplicate with. I have a dataframe that looks like below with some duplicate item pairs (on fit and sit) and other pairs that are not duplicated. While the item pairs are duplicated, the information they contain is unique (e.g., one row will have a value in Value1 for 1 row, but not Value2 and Value 3, the second or 'duplicate' row will have numbers for Value2 and Value3 just not Value1)

current dataframe

     value1 value2 value3 fit   sit  
[1,] "1"    NA     NA     "it1" "it2"
[2,] NA     "3"    "2"    "it2" "it1"
[3,] "2"    "3"    "4"    "it3" "it4"
[4,] NA     NA     NA     "it4" "it3"
[5,] "5"    NA     NA     "it5" "it6"
[6,] NA     NA     "2"    "it6" "it5"
[7,] NA     "4"    NA     "it7" "it9"

code to generate example dataframe

value1<-c(1,NA,2,NA,5,NA,NA)
value2<-c(NA,3,3,NA,NA,NA, 4)
value3<-c(NA,2,4,NA,NA,2, NA)
fit<-c("it1","it2","it3","it4", "it5", "it6","it7")
sit<-c("it2","it1","it4","it3", "it6", "it5", "it9")
df.now<-cbind(value1,value2,value3, fit, sit)

what I want is to convert it to a dataframe that looks like this:

desired dataframe

     val1 val2 val3 it1   it2  
[1,] "1"  "3"  "2"  "it1" "it2"
[2,] "2"  "3"  "4"  "it3" "it4"
[3,] "5"  NA   "2"  "it5" "it6"
[4,] NA   "4"  NA   "it7" "it9"

I was thinking of doing the following steps: 1. create new variables using fit and sit with the lowest item and highest items to identify duplicate pairs 2. identify duplicated item pairs 3. use ifelse to select and fill in unique information.

I know how to do steps 1 and 3, but am stuck on step 2. I think what I need to do is not just identify TRUE/FALSE duplicate, but perhaps have a column with a unique identifier for each item pair like this (there are 2 extra rows because of my step 1):

     value1 value2 value3 fit   sit   lit   hit    dup
[1,] "1"    NA     NA     "it1" "it2" "it1" "it2"   1
[2,] NA     "3"    "2"    "it2" "it1" "it1" "it2"   1
[3,] "2"    "3"    "4"    "it3" "it4" "it3" "it4"   2
[4,] NA     NA     NA     "it4" "it3" "it3" "it4"   2
[5,] "5"    NA     NA     "it5" "it6" "it5" "it6"   3
[6,] NA     NA     "2"    "it6" "it5" "it5" "it6"   3
[7,] NA     "4"    NA     "it7" "it9" "it7" "it9"   NA

I am not sure how to do this.

What I am asking for is either help with step 2 or perhaps there is a better way to solve it than the steps I outlined.

like image 718
Heather Clark Avatar asked Jan 04 '20 14:01

Heather Clark


People also ask

How do I find duplicates in two rows?

Here is how to do it: Select the data. Go to Home –> Conditional Formatting –> Highlight Cell Rules –> Duplicate Values. In the Duplicate Values dialog box, select Duplicate in the drop down on the left, and specify the format in which you want to highlight the duplicate values.

How do I duplicate a row name in R?

It is not possible to have duplicate row names, but a simple workaround is creating an extra column (e.g. label) that holds the name that you would assign to your rows. You can then use this column for the names in the graph instead.


1 Answers

One dplyr option could be:

df.now %>%
 group_by(pair = paste(pmax(fit, sit), pmin(fit, sit), sep = "_")) %>%
 summarise_at(vars(starts_with("value")), ~ ifelse(all(is.na(.)), 
                                                   NA,
                                                   first(na.omit(.))))

  pair    value1 value2 value3
  <chr>    <dbl>  <dbl>  <dbl>
1 it2_it1      1      3      2
2 it4_it3      2      3      4
3 it6_it5      5     NA      2
4 it9_it7     NA      4     NA

And if you also need the pairs in individual columns, then with the addition of tidyr you can do:

df.now %>%
 group_by(pair = paste(pmax(fit, sit), pmin(fit, sit), sep = "_")) %>%
 summarise_at(vars(starts_with("value")), ~ ifelse(all(is.na(.)), 
                                                   NA,
                                                   first(na.omit(.)))) %>%
 separate(pair, into = c("fit", "hit"), sep = "_", remove = FALSE)

  pair    fit   hit   value1 value2 value3
  <chr>   <chr> <chr>  <dbl>  <dbl>  <dbl>
1 it2_it1 it2   it1        1      3      2
2 it4_it3 it4   it3        2      3      4
3 it6_it5 it6   it5        5     NA      2
4 it9_it7 it9   it7       NA      4     NA
like image 151
tmfmnk Avatar answered Sep 30 '22 15:09

tmfmnk