Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient way to get highly correlated pairs from large data set in Python or R

I have a large data set (Let's say 10,000 variables with about 1000 elements each), we can think of it as 2D list, something like:

[[variable_1],
 [variable_2],
 ............
 [variable_n]
]

I want to extract highly correlated variable pairs from that data. I want "highly correlated" to be a parameter that I can choose.

I don't need all pairs to be extracted, and I don't necessarily want the most correlated pairs. As long as there is an efficient method that gets me highly correlated pairs I am happy.

Also, it would be nice if a variable does not show up in more than one pair. Although this might not be crucial.

Of course, there is a brute force way to finding such pairs, but it is too slow for me.

I've googled around for a bit and found some theoretical work on this issue, but I wasn't able for find a package that could do what I am looking for. I mostly work in python, so a package in python would be most helpful, but if there exists a package in R that does what I am looking for it will be great.

Does anyone know of a package that does the above in Python or R? Or any other ideas?

Thank You in Advance

like image 780
Akavall Avatar asked Jan 17 '23 06:01

Akavall


1 Answers

You didn't tell us how fast you need fast to be, so here's a naive solution.

Simply compute the correlation matrix and then use which to get the indices of the pairs you're after:

x <- matrix(rnorm(10000*1000), ncol = 10000)
corm <- cor(x)
out <- which(abs(corm) > 0.80, arr.ind=TRUE)

You can then use subsetting to get rid of the diagonal and redundant pairs:

out[out[,1] > out[,2]]

Calculating the correlation matrix takes about 75 seconds on my machine, the which() part takes about 3 seconds...subsetting out the redundancy takes about 1.2 seconds. Is that too slow?

like image 71
Chase Avatar answered Jan 18 '23 23:01

Chase