Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to scale Pivoting in BigQuery?

Let's say, I have music video play stats table mydataset.stats for a given day (3B rows, 1M users, 6K artists). Simplified schema is: UserGUID String, ArtistGUID String

I need pivot/transpose artists from rows to columns, so schema will be:
UserGUID String, Artist1 Int, Artist2 Int, … Artist8000 Int
With Artist plays count by respective user

There was an approach suggested in How to transpose rows to columns with large amount of the data in BigQuery/SQL? and How to create dummy variable columns for thousands of categories in Google BigQuery? but looks like it doesn’t scale for numbers I have in my example

Can this approach be scaled for my example?

like image 835
Mikhail Berlyant Avatar asked Jan 18 '16 00:01

Mikhail Berlyant


People also ask

How do you transpose a table in BigQuery?

For Transpose the data first of all we have to split the value data based on “;” once we split the value column then will get and array let's take the alias of that column is keyvalue. After that will unnest the array and split the data based '=' and will create 2 column (Key and value).

How do you Unnest in BigQuery?

To convert an ARRAY into a set of rows, also known as "flattening," use the UNNEST operator. UNNEST takes an ARRAY and returns a table with a single row for each element in the ARRAY . Because UNNEST destroys the order of the ARRAY elements, you may wish to restore order to the table.

What is coalesce in BigQuery?

COALESCE(expr[, ... ]) Description. Returns the value of the first non-null expression. The remaining expressions are not evaluated.


1 Answers

I tried below approach for up to 6000 features and it worked as expected. I believe it will work up to 10K features which is hard limit for number of columns in a table

STEP 1 - Aggregate plays by user / artist

SELECT userGUID as uid, artistGUID as aid, COUNT(1) as plays 
FROM [mydataset.stats] GROUP BY 1, 2

STEP 2 – Normalize uid and aid – so they are consecutive numbers 1, 2, 3, … .
We need this at least for two reasons: a) make later dynamically created sql as compact as possible and b) to have more usable/friendly columns names

Combined with first step – it will be:

SELECT u.uid AS uid, a.aid AS aid, plays 
FROM (
  SELECT userGUID, artistGUID, COUNT(1) AS plays 
  FROM [mydataset.stats] 
  GROUP BY 1, 2
) AS s
JOIN (
  SELECT userGUID, ROW_NUMBER() OVER() AS uid FROM [mydataset.stats] GROUP BY 1
) AS u ON u. userGUID = s.userGUID
JOIN (
  SELECT artistGUID, ROW_NUMBER() OVER() AS aid FROM [mydataset.stats] GROUP BY 1
) AS a ON a.artistGUID = s.artistGUID 

Let’s write output to table - mydataset.aggs

STEP 3 – Using already suggested (in above mentioned questions) approach for N features (artists) at a time. In my particular example, by experimenting, I found that basic approach works well for number of features between 2000 and 3000. To be on safe side I decided to use 2000 features at a time

Below script is used for dynamically generating query that then run to create partitioned tables

SELECT 'SELECT uid,' + 
   GROUP_CONCAT_UNQUOTED(
      'SUM(IF(aid=' + STRING(aid) + ',plays,NULL)) as a' + STRING(aid) 
   ) 
   + ' FROM [mydataset.aggs] GROUP EACH BY uid'
FROM (SELECT aid FROM [mydataset.aggs] GROUP BY aid HAVING aid > 0 and aid < 2001)

Above query produces yet another query like below:

SELECT uid,SUM(IF(aid=1,plays,NULL)) a1,SUM(IF(aid=3,plays,NULL)) a3,
  SUM(IF(aid=2,plays,NULL)) a2,SUM(IF(aid=4,plays,NULL)) a4 . . .
FROM [mydataset.aggs] GROUP EACH BY uid 

This should be run and written to mydataset.pivot_1_2000

Executing STEP 3 two more times (adjusting HAVING aid > NNNN and aid < NNNN) we get three more tables mydataset.pivot_2001_4000, mydataset.pivot_4001_6000
As you can see - mydataset.pivot_1_2000 has expected schema but for features with aid from 1 to 2001; mydataset.pivot_2001_4000 has only features with aid from 2001 to 4000; and so on

STEP 4 – Merging all partitioned pivot table to final pivot table with all features represented as columns in one table

Same as in above steps. First we need generate query and then run it So, initially we will “stitch” mydataset.pivot_1_2000 and mydataset.pivot_2001_4000. Then result with mydataset.pivot_4001_6000

SELECT 'SELECT x.uid uid,' + 
   GROUP_CONCAT_UNQUOTED(
      'a' + STRING(aid) 
   ) 
   + ' FROM [mydataset.pivot_1_2000] AS x
JOIN EACH [mydataset.pivot_2001_4000] AS y ON y.uid = x.uid
'
FROM (SELECT aid FROM [mydataset.aggs] GROUP BY aid HAVING aid < 4001 ORDER BY aid)

Output string from above should be run and result written to mydataset.pivot_1_4000

Then we repeat STEP 4 like below

SELECT 'SELECT x.uid uid,' + 
   GROUP_CONCAT_UNQUOTED(
      'a' + STRING(aid) 
   ) 
   + ' FROM [mydataset.pivot_1_4000] AS x
JOIN EACH [mydataset.pivot_4001_6000] AS y ON y.uid = x.uid
'
FROM (SELECT aid FROM [mydataset.aggs] GROUP BY aid HAVING aid < 6001 ORDER BY aid)

Result to be written to mydataset.pivot_1_6000

The resulted table has following schema:

uid int, a1 int, a2 int, a3 int, . . . , a5999 int, a6000 int 

NOTE:
a. I tried this approach only up to 6000 features and it worked as expected
b. Run time for second/main queries in step 3 and 4 varied from 20 to 60 min
c. IMPORTANT: billing tier in steps 3 and 4 varied from 1 to 90. The good news is that respective table’s size is relatively small (30-40MB) so does billing bytes. For “before 2016” projects everything is billed as tier 1 but after October 2016 this can be an issue.
For more information, see Timing in High-Compute queries
d. Above example shows power of large-scale data transformation with BigQuery! Still I think (but I can be wrong) that storing materialized feature matrix is not the best idea

like image 156
Mikhail Berlyant Avatar answered Oct 11 '22 03:10

Mikhail Berlyant