Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I run a paired t-test in BigQuery. Are there any UDFs available?

I am attempting to run a paired t-test in bigquery using the JavaScript statistical.js library and I am getting inconsistent numbers.

My code is as follows;

 CREATE TEMPORARY FUNCTION ttest(a ARRAY<FLOAT64>, b ARRAY<FLOAT64>)
 RETURNS FLOAT64
 LANGUAGE js AS
"""
   return statistical.methods.tTestTwoSample(a, b); 
   //statistical.methods.tTestTwoSample(sample1: Array, sample2: Array) 
"""
OPTIONS (
library="gs://my_bucket/statistical.js"
);


SELECT 
 ttest([9.96,   3.76,   1.17,   8.66,   5.25,   7.61,   5.8,    1.84,   7.06, 9.4,  2.99,   9.3,    9.01,   4.24,   3.52,   9.6,    7.59,   6.99,   9.62,   2.18] , 
    [3.96,  5.76,   7.17,   7.66,   9.25,   3.61,   4.8,    8.84,   6.06,   4.4,    1.99,   8.3,    9.01,   1.24,   5.52,   8.6,    5.59,   1.99,   7.62,   3.18]  ) ttest 

With gs://my_bucket/statistical.js as the location of where the JS package is saved in google storage.

The package is available here https://github.com/adrien2p/statistical-js

The two sample t-test statistical.methods.tTestTwoSample(sample1: Array, sample2: Array) should return a p-value. I was unable to find out if the test was paired or otherwise. When I run the code I get 4.099, which as far as I know is not a p-value. I have ran the test in Excel to validate the result using =T.TEST() and I get a p-value of 0.4871 on a paired two sided test. I have also attempted other t-tests other than paired and attempted treating the 4.099 result as a T-Statistic without any luck of the numbers matching.

So my question is, how can I run a paired t-test in BigQuery with a resulting p-value? Should I be using other packages?

Thanks in advance for any help.

like image 901
Mike G Avatar asked Apr 20 '18 05:04

Mike G


1 Answers

Without much work I decided to calculate the t-statistic manually and use the jStat library (Available at github.com/jstat/jstat) to convert the statistic to a p-value.

My code is as follows;

CREATE TEMPORARY FUNCTION tscore_to_p(a FLOAT64, b FLOAT64, c FLOAT64)
 RETURNS FLOAT64
 LANGUAGE js AS
"""
  return jStat.ttest(a,b,c); //jStat.ttest( tscore, n, sides)
"""
OPTIONS (
 library="gs://my_bucket/jstat.min.js"
);


WITH data AS ( 
 SELECT * FROM 
 (SELECT 9.96 AS A, 3.96 AS B) UNION ALL
 (SELECT 3.76 AS A, 5.76 AS B) UNION ALL
 (SELECT 1.17 AS A, 7.17 AS B) UNION ALL
 (SELECT 8.66 AS A, 7.66 AS B) UNION ALL
 (SELECT 5.25 AS A, 9.25 AS B) UNION ALL
 (SELECT 7.61 AS A, 3.61 AS B) UNION ALL
 (SELECT 5.80 AS A, 4.80 AS B) UNION ALL
 (SELECT 1.84 AS A, 8.84 AS B) UNION ALL
 (SELECT 7.06 AS A, 6.06 AS B) UNION ALL
 (SELECT 9.40 AS A, 4.40 AS B) UNION ALL
 (SELECT 2.99 AS A, 1.99 AS B) UNION ALL
 (SELECT 9.30 AS A, 8.30 AS B) UNION ALL
 (SELECT 9.01 AS A, 9.01 AS B) UNION ALL
 (SELECT 4.24 AS A, 1.24 AS B) UNION ALL
 (SELECT 3.52 AS A, 5.52 AS B) UNION ALL
 (SELECT 9.60 AS A, 8.60 AS B) UNION ALL
 (SELECT 7.59 AS A, 5.59 AS B) UNION ALL
 (SELECT 6.99 AS A, 1.99 AS B) UNION ALL
 (SELECT 9.62 AS A, 7.62 AS B) UNION ALL
 (SELECT 2.18 AS A, 3.18 AS B)
 ) 


SELECT 
   COUNT(*) n
 , COUNT(*)-1 dof
 , AVG(difference) mean
 , STDDEV_SAMP(difference) SD 
 , STDDEV_SAMP(difference)/SQRT(COUNT(*)) SE
 , AVG(difference)/ (STDDEV_SAMP(difference)/SQRT(COUNT(*))) t 
 , tscore_to_p((AVG(difference)/ (STDDEV_SAMP(difference)/SQRT(COUNT(*)))) COUNT(*), 2) p_value 

FROM 
  (SELECT *, (A-B) difference FROM data)

This results in a p-value of 0.4871, identical to that of a paired t-test run in Excel.

Thank you to any who have considered the issue.

like image 183
Mike G Avatar answered Sep 29 '22 01:09

Mike G