Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to remove rows that contain identical pairs in opposite order in 2 columns

Tags:

r

dplyr

In a correlation matrix I would like to get rid of the rows that are basically containing the same information as another row, except instead of "A" and "B" in var1 and var2 column contain "B" and "A" respectively

   var1 var2      value
1   cyl  mpg -0.8521620
2  disp  mpg -0.8475514
3    wt  mpg -0.8676594
4   mpg  cyl -0.8521620
5  disp  cyl  0.9020329
6    hp  cyl  0.8324475
7    vs  cyl -0.8108118
8   mpg disp -0.8475514
9   cyl disp  0.9020329
10   wt disp  0.8879799
11  cyl   hp  0.8324475
12  mpg   wt -0.8676594
13 disp   wt  0.8879799
14  cyl   vs -0.8108118

Here we could drop for instance row 4 with mpg vs cyl since we have cyl vs mpg in row 1 already

I know I could filter for unique values in column value, BUT i don't want to do this as with my enormous data set there is actually a chance of getting identical correlation score with multiple pairs of columns. So it has to be done by name matching col var1 and var2

I have this code so far to filter out data rows that are above a certain correlation value, but are not 1 (variable vs itself)

mtcars %>% 
  as.matrix %>%
  cor %>%
  as.data.frame %>%
  rownames_to_column(var = 'var1') %>%
  gather(var2, value, -var1) %>%
  filter(value > 0.8 | value < -0.8) %>%
  filter(value != 1)

EDIT

Andre's answer

cor %>% {(function(x){x[upper.tri(x)]<-NA; x})(.)} %>%

is faster, but Rui's answer is more generic and can be applied to other situations other than cor matrix calculations.

Unit: milliseconds
   expr      min       lq     mean   median       uq      max neval cld
   Andre 4.818793 5.113676 5.630160 5.408955 5.704825 22.33730   100  a 
   Rui   5.413692 5.761669 7.531146 6.003656 6.583750 78.02836   100   b
like image 983
Mark Avatar asked Mar 05 '23 05:03

Mark


2 Answers

code:

mtcars %>% 
    as.matrix %>%
    cor %>% {(function(x){x[upper.tri(x)]<-NA; x})(.)} %>%
    as.data.frame %>%
    rownames_to_column(var = 'var1') %>%
    gather(var2, value, -var1) %>%
    filter(value > 0.8 | value < -0.8) %>%
    filter(value != 1)

result:

#  var1 var2      value
#1  cyl  mpg -0.8521620
#2 disp  mpg -0.8475514
#3   wt  mpg -0.8676594
#4 disp  cyl  0.9020329
#5   hp  cyl  0.8324475
#6   vs  cyl -0.8108118
#7   wt disp  0.8879799

tricks used:

  • use an anonymous function READ MORE
  • wrap { around anonymous function to prevent default pipe behavior (pipe to first possible position), it seems to work without this step, but I feel safer doing this. READ MORE
  • remove all duplicate values (?upper.tri) (you can even remove the diagonal in that step, to remove the last code snippet filter(value != 1))

my suggestion:

mtcars %>% 
    as.matrix %>%
    cor %>% {(function(x){x[upper.tri(x, diag = T)]<-NA; x})(.)} %>%
    as.data.frame %>%
    rownames_to_column(var = 'var1') %>%
    gather(var2, value, -var1) %>%
    filter(value > 0.8 | value < -0.8)
like image 135
Andre Elrico Avatar answered May 02 '23 18:05

Andre Elrico


Another way is simply to filter by var1 < var2.

mtcars %>% 
  as.matrix %>%
  cor %>%
  as.data.frame %>%
  rownames_to_column(var = 'var1') %>%
  gather(var2, value, -var1) %>%
  filter(value > 0.8 | value < -0.8) %>%
  filter(value != 1) %>%
  filter(var1 < var2)
#  var1 var2      value
#1  cyl  mpg -0.8521620
#2 disp  mpg -0.8475514
#3  cyl disp  0.9020329
#4  cyl   hp  0.8324475
#5  mpg   wt -0.8676594
#6 disp   wt  0.8879799
#7  cyl   vs -0.8108118
like image 31
Rui Barradas Avatar answered May 02 '23 20:05

Rui Barradas