Approximation of covariance for differently sized arrays

Tags:

Are there any common tools in NumPy/SciPy for computing a correlation measure that works even when the input variables are differently sized? In the standard formulation of covariance and correlation, one is required to have the same number of observations for each different variable under test. Typically, you must pass a matrix where each row is a different variable and each column represents a distinct observation.

In my case, I have 9 different variables, but for each variable the number of observations is not constant. Some variables have more observations than others. I know that there are fields like sensor fusion which study problems like this, so what standard tools are out there for computing relational statistics on data series of differing lengths (preferably in Python)?

534

asked Jan 09 '12 21:01

ely

3 Answers

I would examine this page:

http://docs.scipy.org/doc/numpy/reference/generated/numpy.ma.cov.html

UPDATE:

Suppose each row of your data matrix corresponds to a particular random variable, and the entries in the row are observations. What you have is a simple missing data problem, as long as you have a correspondence between the observations. That is to say, if one of your rows has only 10 entries, then do these 10 entries (i.e., trials) correspond to 10 samples of the random variable in the first row? E.g., suppose you have two temperature sensors and they take samples at the same times, but one is faulty and sometimes misses a sample. Then you should treat the trials where the faulty sensor missed generating a reading as "missing data." In your case, it's as simple as creating two vectors in NumPy that are of the same length, putting zeros (or any value, really) in the smaller of the two vectors that correspond to the missing trials, and creating a mask matrix that indicates where your missing values exist in your data matrix.

Supplying such a matrix to the function linked to above should allow you to perform exactly the computation you want.

181

answered Sep 28 '22 00:09

ddodev

"The issue is that each variable corresponds to the response on a survey, and not every survey taker answered every question. Thus, I want some measure of how an answer to question 2, say, affects likelihood of answers to question 8, for example."

This is the missing data problem. I think what's confusing people is that you keep referring to your samples as having different lengths. I think you might be visualizing them like this:

sample 1:

question number: [1,2,3,4,5]
response       : [1,0,1,1,0]

sample 2:

question number: [2,4,5]
response       : [1,1,0]

when sample 2 should be more like this:

question number: [  1,2,  3,4,5]
response       : [NaN,1,NaN,1,0]

It's the question number, not the number of questions answered that's important. Without question-to-question correspondence it's impossible to calculate anything like a covariance matrix.

Anyway, that numpy.ma.cov function that ddodev mentioned calculates the covariance, by taking advantage of the fact that the elements being summed, each only depend on two values.

So it calculates the ones it can. Then when it comes to the step of dividing by n, it divides by the number of values that were calculated (for that particular covvariance-matrix element), instead of the total number of samples.

answered Sep 27 '22 23:09

mdaoust

From a purely mathmatical point of view, I believe they have to be the same. To make them the same you can apply some concepts related to the missing data problem. I guess I am saying it is not strictly a covariance anymore if the vectors aren't the same size. Whatever tool you use will just make up some points in some smart way to make the vectors of equal length.

answered Sep 27 '22 23:09

Matt

Related questions
                            
                                inverting dictionary in python
                            
                                Socket : 2 way communication in python
                            
                                Why am I getting ugly curly brackets around my text in the label widget? - Tkinter
                            
                                Python threading.Thread, scopes and garbage collection
                            
                                Using setBackground correctly
                            
                                Sending POST request with multiples values for same key with requests library
                            
                                Aggregate items in dict
                            
                                Comparing Root-finding (of a function) algorithms in Python
                            
                                Is there a way for BaseRequestHandler classes to be stateful?
                            
                                import rpy quietly
                            
                                Python: create a polynomial of degree n
                            
                                Persistent Hashing of Python Frozen Sets
                            
                                Python's use of global vs specifying the module
                            
                                Extract files from zip without keep the top-level folder with python zipfile
                            
                                Timing functions
                            
                                Is it possible to add a header to the email django sends to admins when a 500 is generated?
                            
                                Design an algorithm, find the most frequently used word in a book
                            
                                Displayin an Image in a QGraphicsScene
                            
                                Unresolved import csv Pydev Eclipse
                            
                                Python, multithreading too slow, multiprocess

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Approximation of covariance for differently sized arrays

Tags:

python

numpy

covariance

ely

People also ask

3 Answers

ddodev

mdaoust

Matt

Recent Activity

Donate For Us