How to use bigquery correlation based on many columns?

Tags:

google-bigquery

Given a dataset of 100k rows and 100 columns, how is it possible to use bigquery CORR() to find the correlation between the rows?

The schema is:

id:integer, feature1:float, feature2:float, ..., feature100:float

Edit This is not a rolling window time series correlation problem. Each row is an observation of 100 features, and I'd like to use bigquery to find the top N similar observations for each row.

437

asked Aug 31 '14 04:08

ali

1 Answers

You want to find the correlation between each column and the other columns?

That would be something like this:

SELECT CORR(col1, col2), CORR(col1, col3), CORR(col1, col4),..., CORR(col99, col100)
FROM [mytable]

That might take a long time to write (unless you automate it). As an alternative, consider a different schema where everything lives in 3 columns. The transformation would run like this:

SELECT colname, value, rowid FROM
(SELECT 'col1' AS colname, col1, rowid AS value FROM [mytable]),
(SELECT 'col2' AS colname, col2, rowid AS value FROM [mytable]),
(SELECT 'col3' AS colname, col3, rowid AS value FROM [mytable]),
...
(SELECT 'col100' AS colname, col100 AS value FROM [mytable])

With this schema you can run all the combined column correlations with a simpler query:

SELECT CORR(a.value, b.value) corr, a.colname, b.colname
FROM [my_new_table] a
JOIN EACH [my_new_table] b
ON a.rowid=b.rowid
WHERE a.colname>b.colname
GROUP BY a.colname, b.colname

(That's what I did on the article linked by @Tjorriemorrie - http://googlecloudplatform.blogspot.mx/2013/09/introducing-corr-to-google-bigquery.html)

Note that the first query might be more complex that this last one, but I suspect it will take less time to run, as no shuffling will be required.

Since this question asks about rows, the initial transformation would be similar, but slightly different:

SELECT column, value, rowid FROM
  (SELECT 'c1' column, c1 AS value, rowid FROM [mytable]),
  (SELECT 'c2' column, c2 AS value, rowid FROM [mytable]),
  (SELECT 'c3' column, c3 AS value, rowid FROM [mytable])

Then the correlation between rows would be computed as in:

SELECT CORR(a.value, b.value), a.rowid, b.rowid
FROM [my_new_table] a
JOIN EACH [my_new_table] b
ON a.column=b.column
WHERE a.rowid < b.rowid
GROUP BY a.rowid, b.rowid

173

answered Oct 01 '22 15:10

Felipe Hoffa

Related questions
                            
                                NOT IN not working in google BigQuery standard sql
                            
                                I use to_gbq on pandas for updating Google BigQuery and get GenericGBQException
                            
                                Reverse- geocoding: How to determine the city closest to a (lat,lon) with BigQuery SQL?
                            
                                BigQuery - using SQL UDF in join predicate
                            
                                Workaround for multiple rollups
                            
                                doing a group by in google Bigquery
                            
                                Creating a public dataset (or: split storage costs and compute costs across two projects)
                            
                                What causes "resources exceeded" in BigQuery?
                            
                                Export Google BigQuery data to Python Pandas dataframe
                            
                                BigQuery API limit exceeded error
                            
                                BigQuery select multiple key values
                            
                                Apps Script, convert a Sheet range to Blob
                            
                                Need help formatting datetime timezone for Google API
                            
                                How to catch any exceptions thrown by BigQueryIO.Write and rescue the data which is failed to output?
                            
                                BigQuery Standard SQL: Delete Duplicates from Table
                            
                                Python Unit Testing Google Bigquery
                            
                                Resources exceeded BigQuery
                            
                                Unable to use json body of gcp cloud scheduler in cloud function as parameter value?
                            
                                Obtaining BigQuery data from JavaScript code
                            
                                BigQuery Subtract Counts of Two Tables?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With