Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove duplicate column pairs, sort rows based on 2 columns [duplicate]

Tags:

r

in the following dataframe I want to keep rows only once if they have duplicate pairs (1 4 and 4 1 are considered the same pair) of Var1 and Var2. I thought of sorting Var1 and Var2 within the row and then remove duplicate rows based on both Var1 and Var2. However, I don't get to my desired result.

This is what my data looks like:

Var1 <- c(1,2,3,4,5,5)
Var2 <- c(4,3,2,1,5,5)
f <- c("blue","green","yellow","red","orange2","grey")
g <- c("blue","green","yellow","red","orange1","grey")
testdata <- data.frame(Var1,Var2,f,g)

I can sort within the rows, however the values of columns f and g should remain untouched, how do I do this?

testdata <- t(apply(testdata, 1, function(x) x[order(x)]))
testdata <- as.data.table(testdata)

Then, I want to remove duplicate rows based on Var1 and Var2

I want to get this as a result:

Var1 Var2 f       g
1    4    blue    blue
2    3    green   green
5    5    orange2 orange1

Thanks for your help!

like image 853
qg7el Avatar asked Mar 20 '15 15:03

qg7el


People also ask

Can you remove duplicates in Excel based on two columns?

Remove duplicate values from multiple columns in ExcelRemoving duplicates from multiple columns in Excel is the exact same process as removing duplicate values from a single column. The only difference is the “Remove Duplicates” dialog box—you'll need to confirm the columns you wish to remove duplicates from.

How do you delete duplicate rows in SQL based on two columns?

In SQL, some rows contain duplicate entries in multiple columns(>1). For deleting such rows, we need to use the DELETE keyword along with self-joining the table with itself.

How do I delete duplicate rows based on two columns in pandas?

By using pandas. DataFrame. drop_duplicates() method you can remove duplicate rows from DataFrame. Using this method you can drop duplicate rows on selected multiple columns or all columns.


2 Answers

In case people are interested in solving this using dplyr:

library(dplyr)
testdata %>% 
   rowwise() %>%
   mutate(key = paste(sort(c(Var1, Var2)), collapse="")) %>%
   distinct(key, .keep_all=T) %>%
   select(-key)

# Source: local data frame [3 x 4]
# Groups: <by row>
# 
# # A tibble: 3 × 4
#    Var1  Var2       f       g
#   <dbl> <dbl>  <fctr>  <fctr>
# 1     1     4    blue    blue
# 2     2     3   green   green
# 3     5     5 orange2 orange1
like image 194
sinQueso Avatar answered Sep 23 '22 17:09

sinQueso


Instead of sorting for the whole dataset, sort the 'Var1', 'Var2', and then use duplicated to remove the duplicate rows

testdata[1:2] <- t( apply(testdata[1:2], 1, sort) )
testdata[!duplicated(testdata[1:2]),]
#   Var1 Var2       f       g
#1    1    4    blue    blue
#2    2    3   green   green
#5    5    5 orange2 orange1
like image 30
akrun Avatar answered Sep 23 '22 17:09

akrun