Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In Python, how can I calculate correlation and statistical significance between two arrays of data?

I have sets of data with two equally long arrays of data, or I can make an array of two-item entries, and I would like to calculate the correlation and statistical significance represented by the data (which may be tightly correlated, or may have no statistically significant correlation).

I am programming in Python and have scipy and numpy installed. I looked and found Calculating Pearson correlation and significance in Python, but that seems to want the data to be manipulated so it falls into a specified range.

What is the proper way to, I assume, ask scipy or numpy to give me the correlation and statistical significance of two arrays?

like image 473
Christos Hayward Avatar asked Jun 20 '12 14:06

Christos Hayward


People also ask

How do you know if a correlation is statistically significant in Python?

To determine if the correlation coefficient between two variables is statistically significant, you can perform a correlation test in Python using the pearsonr function from the SciPy library. This function returns the correlation coefficient between two variables along with the two-tailed p-value.

How do you find the significant correlation between two variables?

The correlation coefficient is determined by dividing the covariance by the product of the two variables' standard deviations. Standard deviation is a measure of the dispersion of data from its average. Covariance is a measure of how two variables change together.

How do you find the correlation coefficient between two variables in Python?

The Pearson Correlation coefficient can be computed in Python using corrcoef() method from Numpy. The input for this function is typically a matrix, say of size mxn , where: Each column represents the values of a random variable. Each row represents a single sample of n random variables.


2 Answers

If you want to calculate the Pearson Correlation Coefficient, then scipy.stats.pearsonr is the way to go; although, the significance is only meaningful for larger data sets. This function does not require the data to be manipulated to fall into a specified range. The value for the correlation falls in the interval [-1,1], perhaps that was the confusion?

If the significance is not terribly important, you can use numpy.corrcoef().

The Mahalanobis distance does take into account the correlation between two arrays, but it provides a distance measure, not a correlation. (Mathematically, the Mahalanobis distance is not a true distance function; nevertheless, it can be used as such in certain contexts to great advantage.)

like image 196
cjohnson318 Avatar answered Oct 13 '22 00:10

cjohnson318


You can use the Mahalanobis distance between these two arrays, which takes into account the correlation between them.

The function is in the scipy package: scipy.spatial.distance.mahalanobis

There's a nice example here

like image 21
Oriol Nieto Avatar answered Oct 12 '22 23:10

Oriol Nieto