Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use bigquery correlation based on many columns?

Given a dataset of 100k rows and 100 columns, how is it possible to use bigquery CORR() to find the correlation between the rows?

The schema is:

id:integer, feature1:float, feature2:float, ..., feature100:float

Edit This is not a rolling window time series correlation problem. Each row is an observation of 100 features, and I'd like to use bigquery to find the top N similar observations for each row.

like image 437
ali Avatar asked Aug 31 '14 04:08

ali


People also ask

Does scaling matter for correlation?

Since the formula for calculating the correlation coefficient standardizes the variables, changes in scale or units of measurement will not affect its value. For this reason, the correlation coefficient is often more useful than a graphical depiction in determining the strength of the association between two variables.

How many columns can BigQuery have?

A table, query result, or view definition can have up to 10,000 columns. With on-demand pricing, your project can have up to 2,000 concurrent slots. BigQuery slots are shared among all queries in a single project.

How do you Unnest in BigQuery?

To convert an ARRAY into a set of rows, also known as "flattening," use the UNNEST operator. UNNEST takes an ARRAY and returns a table with a single row for each element in the ARRAY . Because UNNEST destroys the order of the ARRAY elements, you may wish to restore order to the table.

How do you pivot in BigQuery?

here is a dynamic pivot procedure in standard SQL@Bigquery. it does not aggregate yet. at first, you need to provide a table with already pe-KPI agregated values (if needed) . but it automatically creates a table and generates all the pivoted columns.


1 Answers

You want to find the correlation between each column and the other columns?

That would be something like this:

SELECT CORR(col1, col2), CORR(col1, col3), CORR(col1, col4),..., CORR(col99, col100)
FROM [mytable]

That might take a long time to write (unless you automate it). As an alternative, consider a different schema where everything lives in 3 columns. The transformation would run like this:

SELECT colname, value, rowid FROM
(SELECT 'col1' AS colname, col1, rowid AS value FROM [mytable]),
(SELECT 'col2' AS colname, col2, rowid AS value FROM [mytable]),
(SELECT 'col3' AS colname, col3, rowid AS value FROM [mytable]),
...
(SELECT 'col100' AS colname, col100 AS value FROM [mytable])

With this schema you can run all the combined column correlations with a simpler query:

SELECT CORR(a.value, b.value) corr, a.colname, b.colname
FROM [my_new_table] a
JOIN EACH [my_new_table] b
ON a.rowid=b.rowid
WHERE a.colname>b.colname
GROUP BY a.colname, b.colname

(That's what I did on the article linked by @Tjorriemorrie - http://googlecloudplatform.blogspot.mx/2013/09/introducing-corr-to-google-bigquery.html)

Note that the first query might be more complex that this last one, but I suspect it will take less time to run, as no shuffling will be required.

Since this question asks about rows, the initial transformation would be similar, but slightly different:

SELECT column, value, rowid FROM
  (SELECT 'c1' column, c1 AS value, rowid FROM [mytable]),
  (SELECT 'c2' column, c2 AS value, rowid FROM [mytable]),
  (SELECT 'c3' column, c3 AS value, rowid FROM [mytable]) 

Then the correlation between rows would be computed as in:

SELECT CORR(a.value, b.value), a.rowid, b.rowid
FROM [my_new_table] a
JOIN EACH [my_new_table] b
ON a.column=b.column
WHERE a.rowid < b.rowid
GROUP BY a.rowid, b.rowid
like image 173
Felipe Hoffa Avatar answered Oct 01 '22 15:10

Felipe Hoffa