Let's say, I have music video play stats table mydataset.stats for a given day (3B rows, 1M users, 6K artists). Simplified schema is: UserGUID String, ArtistGUID String I need pivot/transpose artists from rows to columns, so schema will be: UserGUID String, Artist1 Int, Artist2 Int, … Artist8000 Int With Artist plays count by respective user There was an approach suggested in How to transpose rows to columns with large amount of the data in BigQuery/SQL? and How to create dummy variable columns for thousands of categories in Google BigQuery? but looks like it doesn’t scale for numbers I have in my example Can this approach be scaled for my example?

I tried below approach for up to 6000 features and it worked as expected. I believe it will work up to 10K features which is hard limit for number of columns in a table STEP 1 - Aggregate plays by user / artist <pre class="prettyprint"><code>SELECT userGUID as uid, artistGUID as aid, COUNT(1) as plays FROM [mydataset.stats] GROUP BY 1, 2 </code></pre> STEP 2 – Normalize uid and aid – so they are consecutive numbers 1, 2, 3, … . We need this at least for two reasons: a) make later dynamically created sql as compact as possible and b) to have more usable/friendly columns names Combined with first step – it will be: <pre class="prettyprint"><code>SELECT u.uid AS uid, a.aid AS aid, plays FROM ( SELECT userGUID, artistGUID, COUNT(1) AS plays FROM [mydataset.stats] GROUP BY 1, 2 ) AS s JOIN ( SELECT userGUID, ROW_NUMBER() OVER() AS uid FROM [mydataset.stats] GROUP BY 1 ) AS u ON u. userGUID = s.userGUID JOIN ( SELECT artistGUID, ROW_NUMBER() OVER() AS aid FROM [mydataset.stats] GROUP BY 1 ) AS a ON a.artistGUID = s.artistGUID </code></pre> Let’s write output to table - mydataset.aggs STEP 3 – Using already suggested (in above mentioned questions) approach for N features (artists) at a time. In my particular example, by experimenting, I found that basic approach works well for number of features between 2000 and 3000. To be on safe side I decided to use 2000 features at a time Below script is used for dynamically generating query that then run to create partitioned tables <pre class="prettyprint"><code>SELECT 'SELECT uid,' + GROUP_CONCAT_UNQUOTED( 'SUM(IF(aid=' + STRING(aid) + ',plays,NULL)) as a' + STRING(aid) ) + ' FROM [mydataset.aggs] GROUP EACH BY uid' FROM (SELECT aid FROM [mydataset.aggs] GROUP BY aid HAVING aid > 0 and aid < 2001) </code></pre> Above query produces yet another query like below: <pre class="prettyprint"><code>SELECT uid,SUM(IF(aid=1,plays,NULL)) a1,SUM(IF(aid=3,plays,NULL)) a3, SUM(IF(aid=2,plays,NULL)) a2,SUM(IF(aid=4,plays,NULL)) a4 . . . FROM [mydataset.aggs] GROUP EACH BY uid </code></pre> This should be run and written to <code>mydataset.pivot_1_2000</code> Executing STEP 3 two more times (adjusting <code>HAVING aid > NNNN and aid < NNNN</code>) we get three more tables <code>mydataset.pivot_2001_4000</code>, <code>mydataset.pivot_4001_6000</code> As you can see - mydataset.pivot_1_2000 has expected schema but for features with aid from 1 to 2001; mydataset.pivot_2001_4000 has only features with aid from 2001 to 4000; and so on STEP 4 – Merging all partitioned pivot table to final pivot table with all features represented as columns in one table Same as in above steps. First we need generate query and then run it So, initially we will “stitch” mydataset.pivot_1_2000 and mydataset.pivot_2001_4000. Then result with mydataset.pivot_4001_6000 <pre class="prettyprint"><code>SELECT 'SELECT x.uid uid,' + GROUP_CONCAT_UNQUOTED( 'a' + STRING(aid) ) + ' FROM [mydataset.pivot_1_2000] AS x JOIN EACH [mydataset.pivot_2001_4000] AS y ON y.uid = x.uid ' FROM (SELECT aid FROM [mydataset.aggs] GROUP BY aid HAVING aid < 4001 ORDER BY aid) </code></pre> Output string from above should be run and result written to <code>mydataset.pivot_1_4000</code> Then we repeat STEP 4 like below <pre class="prettyprint"><code>SELECT 'SELECT x.uid uid,' + GROUP_CONCAT_UNQUOTED( 'a' + STRING(aid) ) + ' FROM [mydataset.pivot_1_4000] AS x JOIN EACH [mydataset.pivot_4001_6000] AS y ON y.uid = x.uid ' FROM (SELECT aid FROM [mydataset.aggs] GROUP BY aid HAVING aid < 6001 ORDER BY aid) </code></pre> Result to be written to <code>mydataset.pivot_1_6000</code> The resulted table has following schema: <pre class="prettyprint"><code>uid int, a1 int, a2 int, a3 int, . . . , a5999 int, a6000 int </code></pre> NOTE: a. I tried this approach only up to 6000 features and it worked as expected b. Run time for second/main queries in step 3 and 4 varied from 20 to 60 min c. IMPORTANT: billing tier in steps 3 and 4 varied from 1 to 90. The good news is that respective table’s size is relatively small (30-40MB) so does billing bytes. For “before 2016” projects everything is billed as tier 1 but after October 2016 this can be an issue. For more information, see <code>Timing</code> in High-Compute queries d. Above example shows power of large-scale data transformation with BigQuery! Still I think (but I can be wrong) that storing materialized feature matrix is not the best idea

How to scale Pivoting in BigQuery?

Tags:

sql

google-bigquery

Let's say, I have music video play stats table mydataset.stats for a given day (3B rows, 1M users, 6K artists). Simplified schema is: UserGUID String, ArtistGUID String

I need pivot/transpose artists from rows to columns, so schema will be:
UserGUID String, Artist1 Int, Artist2 Int, … Artist8000 Int
With Artist plays count by respective user

There was an approach suggested in How to transpose rows to columns with large amount of the data in BigQuery/SQL? and How to create dummy variable columns for thousands of categories in Google BigQuery? but looks like it doesn’t scale for numbers I have in my example

Can this approach be scaled for my example?

835

asked Jan 18 '16 00:01

Mikhail Berlyant

1 Answers

I tried below approach for up to 6000 features and it worked as expected. I believe it will work up to 10K features which is hard limit for number of columns in a table

STEP 1 - Aggregate plays by user / artist

SELECT userGUID as uid, artistGUID as aid, COUNT(1) as plays 
FROM [mydataset.stats] GROUP BY 1, 2

STEP 2 – Normalize uid and aid – so they are consecutive numbers 1, 2, 3, … .
We need this at least for two reasons: a) make later dynamically created sql as compact as possible and b) to have more usable/friendly columns names

Combined with first step – it will be:

SELECT u.uid AS uid, a.aid AS aid, plays 
FROM (
  SELECT userGUID, artistGUID, COUNT(1) AS plays 
  FROM [mydataset.stats] 
  GROUP BY 1, 2
) AS s
JOIN (
  SELECT userGUID, ROW_NUMBER() OVER() AS uid FROM [mydataset.stats] GROUP BY 1
) AS u ON u. userGUID = s.userGUID
JOIN (
  SELECT artistGUID, ROW_NUMBER() OVER() AS aid FROM [mydataset.stats] GROUP BY 1
) AS a ON a.artistGUID = s.artistGUID

Let’s write output to table - mydataset.aggs

STEP 3 – Using already suggested (in above mentioned questions) approach for N features (artists) at a time. In my particular example, by experimenting, I found that basic approach works well for number of features between 2000 and 3000. To be on safe side I decided to use 2000 features at a time

Below script is used for dynamically generating query that then run to create partitioned tables

SELECT 'SELECT uid,' + 
   GROUP_CONCAT_UNQUOTED(
      'SUM(IF(aid=' + STRING(aid) + ',plays,NULL)) as a' + STRING(aid) 
   ) 
   + ' FROM [mydataset.aggs] GROUP EACH BY uid'
FROM (SELECT aid FROM [mydataset.aggs] GROUP BY aid HAVING aid > 0 and aid < 2001)

Above query produces yet another query like below:

SELECT uid,SUM(IF(aid=1,plays,NULL)) a1,SUM(IF(aid=3,plays,NULL)) a3,
  SUM(IF(aid=2,plays,NULL)) a2,SUM(IF(aid=4,plays,NULL)) a4 . . .
FROM [mydataset.aggs] GROUP EACH BY uid

This should be run and written to mydataset.pivot_1_2000

Executing STEP 3 two more times (adjusting HAVING aid > NNNN and aid < NNNN) we get three more tables mydataset.pivot_2001_4000, mydataset.pivot_4001_6000
As you can see - mydataset.pivot_1_2000 has expected schema but for features with aid from 1 to 2001; mydataset.pivot_2001_4000 has only features with aid from 2001 to 4000; and so on

STEP 4 – Merging all partitioned pivot table to final pivot table with all features represented as columns in one table

Same as in above steps. First we need generate query and then run it So, initially we will “stitch” mydataset.pivot_1_2000 and mydataset.pivot_2001_4000. Then result with mydataset.pivot_4001_6000

SELECT 'SELECT x.uid uid,' + 
   GROUP_CONCAT_UNQUOTED(
      'a' + STRING(aid) 
   ) 
   + ' FROM [mydataset.pivot_1_2000] AS x
JOIN EACH [mydataset.pivot_2001_4000] AS y ON y.uid = x.uid
'
FROM (SELECT aid FROM [mydataset.aggs] GROUP BY aid HAVING aid < 4001 ORDER BY aid)

Output string from above should be run and result written to mydataset.pivot_1_4000

Then we repeat STEP 4 like below

SELECT 'SELECT x.uid uid,' + 
   GROUP_CONCAT_UNQUOTED(
      'a' + STRING(aid) 
   ) 
   + ' FROM [mydataset.pivot_1_4000] AS x
JOIN EACH [mydataset.pivot_4001_6000] AS y ON y.uid = x.uid
'
FROM (SELECT aid FROM [mydataset.aggs] GROUP BY aid HAVING aid < 6001 ORDER BY aid)

Result to be written to mydataset.pivot_1_6000

The resulted table has following schema:

uid int, a1 int, a2 int, a3 int, . . . , a5999 int, a6000 int

NOTE:
a. I tried this approach only up to 6000 features and it worked as expected
b. Run time for second/main queries in step 3 and 4 varied from 20 to 60 min
c. IMPORTANT: billing tier in steps 3 and 4 varied from 1 to 90. The good news is that respective table’s size is relatively small (30-40MB) so does billing bytes. For “before 2016” projects everything is billed as tier 1 but after October 2016 this can be an issue.
For more information, see Timing in High-Compute queries
d. Above example shows power of large-scale data transformation with BigQuery! Still I think (but I can be wrong) that storing materialized feature matrix is not the best idea

156

answered Oct 11 '22 03:10

Mikhail Berlyant

Related questions
                            
                                SQL*Plus - Spool CSV - SP2-0734: unknown command beginnin
                            
                                Postgres 9.4 jsonb array as table
                            
                                How to select rows by time interval in PostgreSQL?
                            
                                PostgreSQL not using index on a filtered multiple sort query
                            
                                sql adding additional rows to each row
                            
                                Combining 2 queries - getting column names in one and using results in another query
                            
                                SQLGrammarException: could not execute query
                            
                                Error using System.Data.Linq.Mapping and auto incrementing the primary key in a sqlite db
                            
                                Omitting columns when importing CSV into Sqlite
                            
                                converting the data with regexp in oracle sql
                            
                                How to order rows by hierarchy
                            
                                Oracle sql : get only specific part of a substring
                            
                                Why use Camel Case for JS and Snake Case for your DB?
                            
                                select random value based on probability chance
                            
                                SQL Sever 2012 - generating scripts - Save to File = Not Run
                            
                                How to insert three new rows for every result of a SELECT query into the same table
                            
                                Escaping special characters for JSON output
                            
                                Caused by: java.sql.SQLException: Column 'id' not found
                            
                                "ERROR: cached plan must not change result type" when mixing DDL with SELECT via JDBC
                            
                                MySQL GROUP_CONCAT() groups all rows

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to scale Pivoting in BigQuery?

Tags:

sql

google-bigquery

Mikhail Berlyant

People also ask

1 Answers

Mikhail Berlyant

Recent Activity

Donate For Us