I have a simple table with 2 columns: UserID and Category, and each UserID can repeat with a few categories, like so:
UserID Category
------ --------
1 A
1 B
2 C
3 A
3 C
3 B
I want to "dummify" this table: i.e. to create an output table that has a unique column for each Category consisting of dummy variables (0/1 depending on whether the UserID belongs to that particular Category):
UserID A B C
------ -- -- --
1 1 1 0
2 0 0 1
3 1 1 1
My problem is that I have THOUSANDS of categories (not just 3 as in this example) and so this cannot be efficiently accomplished using CASE WHEN statement.
So my questions are:
1) Is there a way to "dummify" the Category column in Google BigQuery without using thousands of CASE WHEN statements.
2) Is this a situation where the UDF functionality works well? It seems like it would be the case but I am not familiar enough with UDF in BigQuery to solve this problem. Would someone be able to help out?
Thanks.
Use EXCEPT in a SELECT * to select all fields from a table except the columns you don't want. In the example below, I created a CTE named orders and selected all of the columns except for item_id and item_name.
BigQuery also supports system variables. You do not need to declare system variables, but you can set any of them that are not marked read-only. You can reference system variables in queries. The following example initializes the variable x as an INT64 with the value NULL .
You specify an indicator variable (preceded by a colon) immediately after the host variable. For example: EXEC SQL SELECT COUNT(*), AVG(SALARY) INTO :PLICNT, :PLISAL:INDNULL FROM CORPDATA.
You can use below "technic"
First run query #1. It produces the query (query #2) that you need to run to get result you need. Please, still consider Mosha's comments before going "wild" with thousands categories :o)
Query #1:
SELECT 'select UserID, ' +
GROUP_CONCAT_UNQUOTED(
'sum(if(category = "' + STRING(category) + '", 1, 0)) as ' + STRING(category)
)
+ ' from YourTable group by UserID'
FROM (
SELECT category
FROM YourTable
GROUP BY category
)
Resulted will be like below - Query #2
SELECT
UserID,
SUM(IF(category = "A", 1, 0)) AS A,
SUM(IF(category = "B", 1, 0)) AS B,
SUM(IF(category = "C", 1, 0)) AS C
FROM
YourTable
GROUP BY
UserID
of course for three categories - you could do it manually, but for thousands it will definitelly will make day for you!!
Result of query #2 will looks as you expect:
UserID A B C
1 1 1 0
2 0 0 1
3 1 1 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With