In a correlation matrix I would like to get rid of the rows that are basically containing the same information as another row, except instead of "A" and "B" in var1 and var2 column contain "B" and "A" respectively
var1 var2 value
1 cyl mpg -0.8521620
2 disp mpg -0.8475514
3 wt mpg -0.8676594
4 mpg cyl -0.8521620
5 disp cyl 0.9020329
6 hp cyl 0.8324475
7 vs cyl -0.8108118
8 mpg disp -0.8475514
9 cyl disp 0.9020329
10 wt disp 0.8879799
11 cyl hp 0.8324475
12 mpg wt -0.8676594
13 disp wt 0.8879799
14 cyl vs -0.8108118
Here we could drop for instance row 4 with mpg vs cyl since we have cyl vs mpg in row 1 already
I know I could filter for unique values in column value, BUT i don't want to do this as with my enormous data set there is actually a chance of getting identical correlation score with multiple pairs of columns. So it has to be done by name matching col var1
and var2
I have this code so far to filter out data rows that are above a certain correlation value, but are not 1 (variable vs itself)
mtcars %>%
as.matrix %>%
cor %>%
as.data.frame %>%
rownames_to_column(var = 'var1') %>%
gather(var2, value, -var1) %>%
filter(value > 0.8 | value < -0.8) %>%
filter(value != 1)
EDIT
Andre's answer
cor %>% {(function(x){x[upper.tri(x)]<-NA; x})(.)} %>%
is faster, but Rui's answer is more generic and can be applied to other situations other than cor matrix calculations.
Unit: milliseconds
expr min lq mean median uq max neval cld
Andre 4.818793 5.113676 5.630160 5.408955 5.704825 22.33730 100 a
Rui 5.413692 5.761669 7.531146 6.003656 6.583750 78.02836 100 b
code:
mtcars %>%
as.matrix %>%
cor %>% {(function(x){x[upper.tri(x)]<-NA; x})(.)} %>%
as.data.frame %>%
rownames_to_column(var = 'var1') %>%
gather(var2, value, -var1) %>%
filter(value > 0.8 | value < -0.8) %>%
filter(value != 1)
result:
# var1 var2 value
#1 cyl mpg -0.8521620
#2 disp mpg -0.8475514
#3 wt mpg -0.8676594
#4 disp cyl 0.9020329
#5 hp cyl 0.8324475
#6 vs cyl -0.8108118
#7 wt disp 0.8879799
tricks used:
{
around anonymous function to prevent default pipe behavior (pipe to first possible position), it seems to work without this step, but I feel safer doing this. READ MORE
?upper.tri
) (you can even remove the diagonal in that step, to remove the last code snippet filter(value != 1)
)my suggestion:
mtcars %>%
as.matrix %>%
cor %>% {(function(x){x[upper.tri(x, diag = T)]<-NA; x})(.)} %>%
as.data.frame %>%
rownames_to_column(var = 'var1') %>%
gather(var2, value, -var1) %>%
filter(value > 0.8 | value < -0.8)
Another way is simply to filter
by var1 < var2
.
mtcars %>%
as.matrix %>%
cor %>%
as.data.frame %>%
rownames_to_column(var = 'var1') %>%
gather(var2, value, -var1) %>%
filter(value > 0.8 | value < -0.8) %>%
filter(value != 1) %>%
filter(var1 < var2)
# var1 var2 value
#1 cyl mpg -0.8521620
#2 disp mpg -0.8475514
#3 cyl disp 0.9020329
#4 cyl hp 0.8324475
#5 mpg wt -0.8676594
#6 disp wt 0.8879799
#7 cyl vs -0.8108118
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With