Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cleaning the duplicates with a reference from another data frame

I want to get rid of the duplicates by using correct information in the another data frame.

The problem is original data has the duplicates both with the right values and wrong values. The right values are defined in another data frame, so I want to use that data frame as a reference for those rows.

So the job I want to do conditional for two rows. To illustrate it, lets say the original data is tree1 :

tree1 = data.frame( 
sp = c("oak","pine","apple","birch","oak","pine","apple","maple"), 
code = c(23:26,77,88,99,27))
> tree1
     sp code
1   oak   23
2  pine   24
3 apple   25
4 birch   26
5   oak   77
6  pine   88
7 apple   99
8 maple   27

And the reference data is tree2:

tree2 = data.frame( sp = c("oak","pine","apple"),
                    code = 23:25)
> tree2
     sp code
1   oak   23
2  pine   24
3 apple   25

And my desired output that I get rid of the duplicates with wrong values where I still have the original data should seem like below:

> tree3
     sp code
1   oak   23
2  pine   24
3 apple   25
4 birch   26
5 maple   27

I know that it seems like an easy conditional operation but I ended up deleting some original values or keeping the duplicates with wrong values in the end (other way around is not working). Simple R-base help would be great.

like image 736
DSA Avatar asked Jan 27 '23 11:01

DSA


2 Answers

One option using base R mapply. Assuming you have same columns in tree1 and tree2 and in same order we can check values in tree1 which are present in tree2 and select only those rows where all the values match or no values match.

vals <- rowSums(mapply(`%in%`, tree1, tree2))
tree1[vals == ncol(tree1) | vals == 0, ]

#    sp  code
#1   oak   23
#2  pine   24
#3 apple   25
#4 birch   26
#8 maple   27
like image 150
Ronak Shah Avatar answered Jan 29 '23 00:01

Ronak Shah


Here is a dplyr option:

library(dplyr)
tree2bis <- filter(tree1, !(tree1$sp %in% tree2$sp)) # dataframe with no duplicated rows
tree1 %>% inner_join(tree2) %>% bind_rows(tree2bis)
# output
     sp code
1   oak   23
2  pine   24
3 apple   25
4 birch   26
5 maple   27
like image 28
nghauran Avatar answered Jan 29 '23 02:01

nghauran