<p>I would like to identify and mark duplicate rows based on 2 columns. I would like to make a unique identifier for each duplicate so I know not just that the row is a duplicate, but which row it is a duplicate with. I have a dataframe that looks like below with some duplicate item pairs (on fit and sit) and other pairs that are not duplicated. While the item pairs are duplicated, the information they contain is unique (e.g., one row will have a value in Value1 for 1 row, but not Value2 and Value 3, the second or 'duplicate' row will have numbers for Value2 and Value3 just not Value1)</p> <h3>current dataframe</h3> <pre class="prettyprint"><code> value1 value2 value3 fit sit [1,] "1" NA NA "it1" "it2" [2,] NA "3" "2" "it2" "it1" [3,] "2" "3" "4" "it3" "it4" [4,] NA NA NA "it4" "it3" [5,] "5" NA NA "it5" "it6" [6,] NA NA "2" "it6" "it5" [7,] NA "4" NA "it7" "it9" </code></pre> <p>code to generate example dataframe</p> <pre class="prettyprint"><code>value1<-c(1,NA,2,NA,5,NA,NA) value2<-c(NA,3,3,NA,NA,NA, 4) value3<-c(NA,2,4,NA,NA,2, NA) fit<-c("it1","it2","it3","it4", "it5", "it6","it7") sit<-c("it2","it1","it4","it3", "it6", "it5", "it9") df.now<-cbind(value1,value2,value3, fit, sit) </code></pre> <p>what I want is to convert it to a dataframe that looks like this: </p> <h3>desired dataframe</h3> <pre class="prettyprint"><code> val1 val2 val3 it1 it2 [1,] "1" "3" "2" "it1" "it2" [2,] "2" "3" "4" "it3" "it4" [3,] "5" NA "2" "it5" "it6" [4,] NA "4" NA "it7" "it9" </code></pre> <p>I was thinking of doing the following steps: 1. create new variables using fit and sit with the lowest item and highest items to identify duplicate pairs 2. identify duplicated item pairs 3. use ifelse to select and fill in unique information. </p> <p>I know how to do steps 1 and 3, but am stuck on step 2. I think what I need to do is not just identify TRUE/FALSE duplicate, but perhaps have a column with a unique identifier for each item pair like this (there are 2 extra rows because of my step 1): </p> <pre class="prettyprint"><code> value1 value2 value3 fit sit lit hit dup [1,] "1" NA NA "it1" "it2" "it1" "it2" 1 [2,] NA "3" "2" "it2" "it1" "it1" "it2" 1 [3,] "2" "3" "4" "it3" "it4" "it3" "it4" 2 [4,] NA NA NA "it4" "it3" "it3" "it4" 2 [5,] "5" NA NA "it5" "it6" "it5" "it6" 3 [6,] NA NA "2" "it6" "it5" "it5" "it6" 3 [7,] NA "4" NA "it7" "it9" "it7" "it9" NA </code></pre> <p>I am not sure how to do this. </p> <p>What I am asking for is either help with step 2 or perhaps there is a better way to solve it than the steps I outlined. </p>

<p>One <code>dplyr</code> option could be:</p> <pre class="prettyprint"><code>df.now %>% group_by(pair = paste(pmax(fit, sit), pmin(fit, sit), sep = "_")) %>% summarise_at(vars(starts_with("value")), ~ ifelse(all(is.na(.)), NA, first(na.omit(.)))) pair value1 value2 value3 <chr> <dbl> <dbl> <dbl> 1 it2_it1 1 3 2 2 it4_it3 2 3 4 3 it6_it5 5 NA 2 4 it9_it7 NA 4 NA </code></pre> <p>And if you also need the pairs in individual columns, then with the addition of <code>tidyr</code> you can do:</p> <pre class="prettyprint"><code>df.now %>% group_by(pair = paste(pmax(fit, sit), pmin(fit, sit), sep = "_")) %>% summarise_at(vars(starts_with("value")), ~ ifelse(all(is.na(.)), NA, first(na.omit(.)))) %>% separate(pair, into = c("fit", "hit"), sep = "_", remove = FALSE) pair fit hit value1 value2 value3 <chr> <chr> <chr> <dbl> <dbl> <dbl> 1 it2_it1 it2 it1 1 3 2 2 it4_it3 it4 it3 2 3 4 3 it6_it5 it6 it5 5 NA 2 4 it9_it7 it9 it7 NA 4 NA </code></pre>

identify and mark duplicate rows in r

Tags:

dataframe

r

I would like to identify and mark duplicate rows based on 2 columns. I would like to make a unique identifier for each duplicate so I know not just that the row is a duplicate, but which row it is a duplicate with. I have a dataframe that looks like below with some duplicate item pairs (on fit and sit) and other pairs that are not duplicated. While the item pairs are duplicated, the information they contain is unique (e.g., one row will have a value in Value1 for 1 row, but not Value2 and Value 3, the second or 'duplicate' row will have numbers for Value2 and Value3 just not Value1)

current dataframe

     value1 value2 value3 fit   sit  
[1,] "1"    NA     NA     "it1" "it2"
[2,] NA     "3"    "2"    "it2" "it1"
[3,] "2"    "3"    "4"    "it3" "it4"
[4,] NA     NA     NA     "it4" "it3"
[5,] "5"    NA     NA     "it5" "it6"
[6,] NA     NA     "2"    "it6" "it5"
[7,] NA     "4"    NA     "it7" "it9"

code to generate example dataframe

value1<-c(1,NA,2,NA,5,NA,NA)
value2<-c(NA,3,3,NA,NA,NA, 4)
value3<-c(NA,2,4,NA,NA,2, NA)
fit<-c("it1","it2","it3","it4", "it5", "it6","it7")
sit<-c("it2","it1","it4","it3", "it6", "it5", "it9")
df.now<-cbind(value1,value2,value3, fit, sit)

what I want is to convert it to a dataframe that looks like this:

desired dataframe

     val1 val2 val3 it1   it2  
[1,] "1"  "3"  "2"  "it1" "it2"
[2,] "2"  "3"  "4"  "it3" "it4"
[3,] "5"  NA   "2"  "it5" "it6"
[4,] NA   "4"  NA   "it7" "it9"

I was thinking of doing the following steps: 1. create new variables using fit and sit with the lowest item and highest items to identify duplicate pairs 2. identify duplicated item pairs 3. use ifelse to select and fill in unique information.

I know how to do steps 1 and 3, but am stuck on step 2. I think what I need to do is not just identify TRUE/FALSE duplicate, but perhaps have a column with a unique identifier for each item pair like this (there are 2 extra rows because of my step 1):

     value1 value2 value3 fit   sit   lit   hit    dup
[1,] "1"    NA     NA     "it1" "it2" "it1" "it2"   1
[2,] NA     "3"    "2"    "it2" "it1" "it1" "it2"   1
[3,] "2"    "3"    "4"    "it3" "it4" "it3" "it4"   2
[4,] NA     NA     NA     "it4" "it3" "it3" "it4"   2
[5,] "5"    NA     NA     "it5" "it6" "it5" "it6"   3
[6,] NA     NA     "2"    "it6" "it5" "it5" "it6"   3
[7,] NA     "4"    NA     "it7" "it9" "it7" "it9"   NA

I am not sure how to do this.

What I am asking for is either help with step 2 or perhaps there is a better way to solve it than the steps I outlined.

718

asked Jan 04 '20 14:01

Heather Clark

1 Answers

One dplyr option could be:

df.now %>%
 group_by(pair = paste(pmax(fit, sit), pmin(fit, sit), sep = "_")) %>%
 summarise_at(vars(starts_with("value")), ~ ifelse(all(is.na(.)), 
                                                   NA,
                                                   first(na.omit(.))))

  pair    value1 value2 value3
  <chr>    <dbl>  <dbl>  <dbl>
1 it2_it1      1      3      2
2 it4_it3      2      3      4
3 it6_it5      5     NA      2
4 it9_it7     NA      4     NA

And if you also need the pairs in individual columns, then with the addition of tidyr you can do:

df.now %>%
 group_by(pair = paste(pmax(fit, sit), pmin(fit, sit), sep = "_")) %>%
 summarise_at(vars(starts_with("value")), ~ ifelse(all(is.na(.)), 
                                                   NA,
                                                   first(na.omit(.)))) %>%
 separate(pair, into = c("fit", "hit"), sep = "_", remove = FALSE)

  pair    fit   hit   value1 value2 value3
  <chr>   <chr> <chr>  <dbl>  <dbl>  <dbl>
1 it2_it1 it2   it1        1      3      2
2 it4_it3 it4   it3        2      3      4
3 it6_it5 it6   it5        5     NA      2
4 it9_it7 it9   it7       NA      4     NA

151

answered Sep 30 '22 15:09

tmfmnk

Related questions
                            
                                Substitute A for B and B for A in a string
                            
                                Filling bars in barplot with textiles in ggplot2 [duplicate]
                            
                                Linear model (lm) when dependent variable is a factor/categorical variable?
                            
                                Multiple RowSideColor columns heatmap.2 from gplots package
                            
                                r knitr chunk options for figure height / width are not working
                            
                                List of Rcpp sugar functions?
                            
                                Merge data frame with SpatialPolygonsDataFrame
                            
                                Select values from different columns based on a variable containing column names [duplicate]
                            
                                Divide each each cell of large matrix by sum of its row
                            
                                ggplot2: change strip.text position in facet_grid plot
                            
                                Set linetype for geom_vline?
                            
                                Create a default comment header template in R?
                            
                                Extract Text from Two-Column PDF with R
                            
                                How to retrieve Outlook inbox emails using R RDCOMClient?
                            
                                Cumulative sum in a window (or running window sum) based on a condition in R
                            
                                How to make tibbles display significant digits
                            
                                Multiple functions on multiple columns by group, and create informative column names
                            
                                reticulate ImportError: No module named pandas in Rstudio version 1.2
                            
                                Difference in legend position between ggplot and ggplotly?
                            
                                Why does R 3.6.0 return FALSE when evaluating the expression ("Dogs" < "cats")?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With