I have a large data set (Let's say 10,000 variables with about 1000 elements each), we can think of it as 2D list, something like:
[[variable_1],
[variable_2],
............
[variable_n]
]
I want to extract highly correlated variable pairs from that data. I want "highly correlated" to be a parameter that I can choose.
I don't need all pairs to be extracted, and I don't necessarily want the most correlated pairs. As long as there is an efficient method that gets me highly correlated pairs I am happy.
Also, it would be nice if a variable does not show up in more than one pair. Although this might not be crucial.
Of course, there is a brute force way to finding such pairs, but it is too slow for me.
I've googled around for a bit and found some theoretical work on this issue, but I wasn't able for find a package that could do what I am looking for. I mostly work in python, so a package in python would be most helpful, but if there exists a package in R that does what I am looking for it will be great.
Does anyone know of a package that does the above in Python or R? Or any other ideas?
Thank You in Advance
You didn't tell us how fast you need fast to be, so here's a naive solution.
Simply compute the correlation matrix and then use which
to get the indices of the pairs you're after:
x <- matrix(rnorm(10000*1000), ncol = 10000)
corm <- cor(x)
out <- which(abs(corm) > 0.80, arr.ind=TRUE)
You can then use subsetting to get rid of the diagonal and redundant pairs:
out[out[,1] > out[,2]]
Calculating the correlation matrix takes about 75 seconds on my machine, the which()
part takes about 3 seconds...subsetting out the redundancy takes about 1.2 seconds. Is that too slow?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With