Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Approximation of covariance for differently sized arrays

Are there any common tools in NumPy/SciPy for computing a correlation measure that works even when the input variables are differently sized? In the standard formulation of covariance and correlation, one is required to have the same number of observations for each different variable under test. Typically, you must pass a matrix where each row is a different variable and each column represents a distinct observation.

In my case, I have 9 different variables, but for each variable the number of observations is not constant. Some variables have more observations than others. I know that there are fields like sensor fusion which study problems like this, so what standard tools are out there for computing relational statistics on data series of differing lengths (preferably in Python)?

like image 534
ely Avatar asked Jan 09 '12 21:01

ely


People also ask

Does the size of covariance matter?

The magnitude of the covariance is not meaningful to interpret. However, the standardized version of the covariance, the correlation coefficient, indicates by its magnitude the strength of the relationship. A covariance matrix measures the covariance between many variables.

Does covariance depend on scale?

Change in scale: Covariance is affected by the change in scale, i.e. if all the values of one variable are multiplied by a constant and all the values of another variable are multiplied by a similar or different constant, then the covariance is changed. Conversely, correlation is not affected by the change in scale.

Does covariance change if units change?

Even a change in the units of measurement can change the covariance. Thus, covariance is only useful to find the direction of the relationship between two variables and not the magnitude.

Is covariance sensitive to scale?

Covariance is affected by the change in scale, i.e. if all the value of one variable is multiplied by a constant and all the value of another variable are multiplied, by a similar or different constant, then the covariance is changed. As against this, correlation is not influenced by the change in scale.


3 Answers

I would examine this page:

http://docs.scipy.org/doc/numpy/reference/generated/numpy.ma.cov.html

UPDATE:

Suppose each row of your data matrix corresponds to a particular random variable, and the entries in the row are observations. What you have is a simple missing data problem, as long as you have a correspondence between the observations. That is to say, if one of your rows has only 10 entries, then do these 10 entries (i.e., trials) correspond to 10 samples of the random variable in the first row? E.g., suppose you have two temperature sensors and they take samples at the same times, but one is faulty and sometimes misses a sample. Then you should treat the trials where the faulty sensor missed generating a reading as "missing data." In your case, it's as simple as creating two vectors in NumPy that are of the same length, putting zeros (or any value, really) in the smaller of the two vectors that correspond to the missing trials, and creating a mask matrix that indicates where your missing values exist in your data matrix.

Supplying such a matrix to the function linked to above should allow you to perform exactly the computation you want.

like image 181
ddodev Avatar answered Sep 28 '22 00:09

ddodev


"The issue is that each variable corresponds to the response on a survey, and not every survey taker answered every question. Thus, I want some measure of how an answer to question 2, say, affects likelihood of answers to question 8, for example."

This is the missing data problem. I think what's confusing people is that you keep referring to your samples as having different lengths. I think you might be visualizing them like this:

sample 1:

question number: [1,2,3,4,5]
response       : [1,0,1,1,0]

sample 2:

question number: [2,4,5]
response       : [1,1,0]

when sample 2 should be more like this:

question number: [  1,2,  3,4,5]
response       : [NaN,1,NaN,1,0]

It's the question number, not the number of questions answered that's important. Without question-to-question correspondence it's impossible to calculate anything like a covariance matrix.

Anyway, that numpy.ma.cov function that ddodev mentioned calculates the covariance, by taking advantage of the fact that the elements being summed, each only depend on two values.

So it calculates the ones it can. Then when it comes to the step of dividing by n, it divides by the number of values that were calculated (for that particular covvariance-matrix element), instead of the total number of samples.

like image 22
mdaoust Avatar answered Sep 27 '22 23:09

mdaoust


From a purely mathmatical point of view, I believe they have to be the same. To make them the same you can apply some concepts related to the missing data problem. I guess I am saying it is not strictly a covariance anymore if the vectors aren't the same size. Whatever tool you use will just make up some points in some smart way to make the vectors of equal length.

like image 45
Matt Avatar answered Sep 27 '22 23:09

Matt