Given a dataset of 100k rows and 100 columns, how is it possible to use bigquery CORR() to find the correlation between the rows?
The schema is:
id:integer, feature1:float, feature2:float, ..., feature100:float
Edit This is not a rolling window time series correlation problem. Each row is an observation of 100 features, and I'd like to use bigquery to find the top N similar observations for each row.
Since the formula for calculating the correlation coefficient standardizes the variables, changes in scale or units of measurement will not affect its value. For this reason, the correlation coefficient is often more useful than a graphical depiction in determining the strength of the association between two variables.
A table, query result, or view definition can have up to 10,000 columns. With on-demand pricing, your project can have up to 2,000 concurrent slots. BigQuery slots are shared among all queries in a single project.
To convert an ARRAY into a set of rows, also known as "flattening," use the UNNEST operator. UNNEST takes an ARRAY and returns a table with a single row for each element in the ARRAY . Because UNNEST destroys the order of the ARRAY elements, you may wish to restore order to the table.
here is a dynamic pivot procedure in standard SQL@Bigquery. it does not aggregate yet. at first, you need to provide a table with already pe-KPI agregated values (if needed) . but it automatically creates a table and generates all the pivoted columns.
You want to find the correlation between each column and the other columns?
That would be something like this:
SELECT CORR(col1, col2), CORR(col1, col3), CORR(col1, col4),..., CORR(col99, col100)
FROM [mytable]
That might take a long time to write (unless you automate it). As an alternative, consider a different schema where everything lives in 3 columns. The transformation would run like this:
SELECT colname, value, rowid FROM
(SELECT 'col1' AS colname, col1, rowid AS value FROM [mytable]),
(SELECT 'col2' AS colname, col2, rowid AS value FROM [mytable]),
(SELECT 'col3' AS colname, col3, rowid AS value FROM [mytable]),
...
(SELECT 'col100' AS colname, col100 AS value FROM [mytable])
With this schema you can run all the combined column correlations with a simpler query:
SELECT CORR(a.value, b.value) corr, a.colname, b.colname
FROM [my_new_table] a
JOIN EACH [my_new_table] b
ON a.rowid=b.rowid
WHERE a.colname>b.colname
GROUP BY a.colname, b.colname
(That's what I did on the article linked by @Tjorriemorrie - http://googlecloudplatform.blogspot.mx/2013/09/introducing-corr-to-google-bigquery.html)
Note that the first query might be more complex that this last one, but I suspect it will take less time to run, as no shuffling will be required.
Since this question asks about rows, the initial transformation would be similar, but slightly different:
SELECT column, value, rowid FROM
(SELECT 'c1' column, c1 AS value, rowid FROM [mytable]),
(SELECT 'c2' column, c2 AS value, rowid FROM [mytable]),
(SELECT 'c3' column, c3 AS value, rowid FROM [mytable])
Then the correlation between rows would be computed as in:
SELECT CORR(a.value, b.value), a.rowid, b.rowid
FROM [my_new_table] a
JOIN EACH [my_new_table] b
ON a.column=b.column
WHERE a.rowid < b.rowid
GROUP BY a.rowid, b.rowid
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With