in the following dataframe I want to keep rows only once if they have duplicate pairs (1 4 and 4 1 are considered the same pair) of Var1
and Var2
. I thought of sorting Var1
and Var2
within the row and then remove duplicate rows based on both Var1
and Var2
. However, I don't get to my desired result.
This is what my data looks like:
Var1 <- c(1,2,3,4,5,5)
Var2 <- c(4,3,2,1,5,5)
f <- c("blue","green","yellow","red","orange2","grey")
g <- c("blue","green","yellow","red","orange1","grey")
testdata <- data.frame(Var1,Var2,f,g)
I can sort within the rows, however the values of columns f and g should remain untouched, how do I do this?
testdata <- t(apply(testdata, 1, function(x) x[order(x)]))
testdata <- as.data.table(testdata)
Then, I want to remove duplicate rows based on Var1
and Var2
I want to get this as a result:
Var1 Var2 f g
1 4 blue blue
2 3 green green
5 5 orange2 orange1
Thanks for your help!
Remove duplicate values from multiple columns in ExcelRemoving duplicates from multiple columns in Excel is the exact same process as removing duplicate values from a single column. The only difference is the “Remove Duplicates” dialog box—you'll need to confirm the columns you wish to remove duplicates from.
In SQL, some rows contain duplicate entries in multiple columns(>1). For deleting such rows, we need to use the DELETE keyword along with self-joining the table with itself.
By using pandas. DataFrame. drop_duplicates() method you can remove duplicate rows from DataFrame. Using this method you can drop duplicate rows on selected multiple columns or all columns.
In case people are interested in solving this using dplyr:
library(dplyr)
testdata %>%
rowwise() %>%
mutate(key = paste(sort(c(Var1, Var2)), collapse="")) %>%
distinct(key, .keep_all=T) %>%
select(-key)
# Source: local data frame [3 x 4]
# Groups: <by row>
#
# # A tibble: 3 × 4
# Var1 Var2 f g
# <dbl> <dbl> <fctr> <fctr>
# 1 1 4 blue blue
# 2 2 3 green green
# 3 5 5 orange2 orange1
Instead of sorting for the whole dataset, sort the 'Var1', 'Var2', and then use duplicated
to remove the duplicate rows
testdata[1:2] <- t( apply(testdata[1:2], 1, sort) )
testdata[!duplicated(testdata[1:2]),]
# Var1 Var2 f g
#1 1 4 blue blue
#2 2 3 green green
#5 5 5 orange2 orange1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With