Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

MySQL Math - Is it possible to calculate a correlation in a query?

Tags:

math

mysql

In a MySQL (5.1) database table there is data that represents:

  • how long a user takes to perform a task and
  • how many items the user handled during the task.

Would MySQL support correlating the data or do I need to use PHP/C# to calcuate?

Where would I find a good formula to calculate correlation (it's been a long time since I last did this)?

like image 914
John M Avatar asked Mar 16 '10 19:03

John M


People also ask

Can you calculate correlation in SQL?

Correlation in SQLCalculating the correlation coefficient in SQL is pretty straightforward. The corr(Y, X) function returns the correlation coefficient between a set of variables.

How do you find correlation between two data?

The correlation coefficient is determined by dividing the covariance by the product of the two variables' standard deviations. Standard deviation is a measure of the dispersion of data from its average. Covariance is a measure of how two variables change together.

What is correlation variable in SQL?

The Correlation function calculates the correlation coefficient of two pairs of values by first evaluating the specified set against the first numeric expression to obtain the values for the y-axis.


2 Answers

Single-Pass Solution

There are two flavors of the Pearson correlation coefficient, one for a Sample and one for an entire Population. These are single-pass and, I believe, correct formulas for both:

-- Methods for calculating the two Pearson correlation coefficients
SELECT  
        -- For Population
        (avg(x * y) - avg(x) * avg(y)) / 
        (sqrt(avg(x * x) - avg(x) * avg(x)) * sqrt(avg(y * y) - avg(y) * avg(y))) 
        AS correlation_coefficient_population,
        -- For Sample
        (count(*) * sum(x * y) - sum(x) * sum(y)) / 
        (sqrt(count(*) * sum(x * x) - sum(x) * sum(x)) * sqrt(count(*) * sum(y * y) - sum(y) * sum(y))) 
        AS correlation_coefficient_sample
    FROM your_table;

I developed and tested this as T-SQL. The code that generated the test data didn't translate to MySQL but the formulas should. Make sure your x and y are decimals values; integer math can significantly impact these calcs.

like image 125
jwolf Avatar answered Sep 27 '22 18:09

jwolf


Here's a rough implementation of the sample correlation coefficient as described in:

Wikipedia - Correlation and Dependence

create table sample( x float not null, y float not null );
insert into sample values (1, 10), (2, 4), (3, 5), (6,17);

select @ax := avg(x), 
       @ay := avg(y), 
       @div := (stddev_samp(x) * stddev_samp(y))
from sample;

select sum( ( x - @ax ) * (y - @ay) ) / ((count(x) -1) * @div) from sample;
+---------------------------------------------------------+
| sum( ( x - @ax ) * (y - @ay) ) / ((count(x) -1) * @div) |
+---------------------------------------------------------+
|                                       0.700885077729073 |
+---------------------------------------------------------+
like image 44
Martin Avatar answered Sep 27 '22 19:09

Martin