Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to create dummy variable columns for thousands of categories in Google BigQuery?

I have a simple table with 2 columns: UserID and Category, and each UserID can repeat with a few categories, like so:

UserID   Category
------   --------
1         A
1         B
2         C
3         A
3         C
3         B

I want to "dummify" this table: i.e. to create an output table that has a unique column for each Category consisting of dummy variables (0/1 depending on whether the UserID belongs to that particular Category):

UserID    A  B  C
------    -- -- --
1         1  1  0
2         0  0  1
3         1  1  1

My problem is that I have THOUSANDS of categories (not just 3 as in this example) and so this cannot be efficiently accomplished using CASE WHEN statement.

So my questions are:

1) Is there a way to "dummify" the Category column in Google BigQuery without using thousands of CASE WHEN statements.

2) Is this a situation where the UDF functionality works well? It seems like it would be the case but I am not familiar enough with UDF in BigQuery to solve this problem. Would someone be able to help out?

Thanks.

like image 937
wubr2000 Avatar asked Nov 30 '15 23:11

wubr2000


People also ask

How do I select all columns except one in BigQuery?

Use EXCEPT in a SELECT * to select all fields from a table except the columns you don't want. In the example below, I created a CTE named orders and selected all of the columns except for item_id and item_name.

Can you declare variables in BigQuery?

BigQuery also supports system variables. You do not need to declare system variables, but you can set any of them that are not marked read-only. You can reference system variables in queries. The following example initializes the variable x as an INT64 with the value NULL .

How do you create an indicator variable in SQL?

You specify an indicator variable (preceded by a colon) immediately after the host variable. For example: EXEC SQL SELECT COUNT(*), AVG(SALARY) INTO :PLICNT, :PLISAL:INDNULL FROM CORPDATA.


1 Answers

You can use below "technic"

First run query #1. It produces the query (query #2) that you need to run to get result you need. Please, still consider Mosha's comments before going "wild" with thousands categories :o)

Query #1:

SELECT 'select UserID, ' + 
   GROUP_CONCAT_UNQUOTED(
    'sum(if(category = "' + STRING(category) + '", 1, 0)) as ' + STRING(category)
   ) 
   + ' from YourTable group by UserID'
FROM (
  SELECT category 
  FROM YourTable  
  GROUP BY category
)

Resulted will be like below - Query #2

SELECT
  UserID,
  SUM(IF(category = "A", 1, 0)) AS A,
  SUM(IF(category = "B", 1, 0)) AS B,
  SUM(IF(category = "C", 1, 0)) AS C
FROM
  YourTable
GROUP BY
  UserID

of course for three categories - you could do it manually, but for thousands it will definitelly will make day for you!!

Result of query #2 will looks as you expect:

UserID  A   B   C    
1       1   1   0    
2       0   0   1    
3       1   1   1    
like image 69
Mikhail Berlyant Avatar answered Oct 04 '22 20:10

Mikhail Berlyant