Suppose I have the following query sent to BQ:
SELECT shipmentID, category, quantity
FROM [myDataset.myTable]
Further, suppose that the query returns data such as:
shipmentID category quantity
1 shoes 5
1 hats 3
2 shirts 1
2 hats 2
3 toys 3
2 books 1
3 shirts 1
How can I pivot the results, from within BQ, to produce output as follows:
shipmentID shoes hats shirts toys books
1 5 3 0 0 0
2 0 2 1 0 1
3 0 0 1 3 0
As some additional background, I actually have 2000+ categories that I need to pivot, and the quantity of data is such that I can't do it directly through a Pandas DataFrame in Python (uses all the memory, then slows to a crawl). I tried using a relational database, but ran into a column limit, so I'd like to be able to do it directly in BQ, even if I have to build the query itself through python. Any suggestions?
** Edit 1 I should mention that pivoting the data itself can be done in chunks and is therefore not the issue. The real trouble comes in trying to do the aggregation afterwards, so that I have only one row for each shipmentID. That's what eats all the RAM.
** Edit 2 After trying out the accepted answer below, I found that trying to use it to create a 2k+ column pivot table was causing "Resources exceeded" errors. My BQ team was able to refactor the query to break it into smaller chunks and allow it to go through. The basic structure of the query is as follows:
SELECT
SetA.*,
SetB.*,
SetC.*
FROM (
SELECT
shipmentID,
SUM(IF (category="Rocks", qty, 0)),
SUM(IF (category="Paper", qty, 0)),
SUM(IF (category="Scissors", qty, 0))
FROM (
SELECT
a.shipmentid shipmentid,
a.quantity quantity,
a.category category
FROM
[myDataset.myTable] a)
GROUP EACH BY
shipmentID ) SetA
INNER JOIN EACH (
SELECT
shipmentID,
SUM(IF (category="Jello Molds", quantity, 0)),
SUM(IF (category="Torque Wrenches", quantity, 0))
FROM (
SELECT
a.shipmentID shipmentID,
a.quantity quantity,
a.category category
FROM
[myDataset.myTable] a)
GROUP EACH BY
shipmentID ) SetB
ON
SetA.shipmentid = SetB.shipmentid
INNER JOIN EACH (
SELECT
shipmentID,
SUM(IF (category="Deep Thoughts", qty, 0)),
SUM(IF (category="Rainbows", qty, 0)),
SUM(IF (category="Ponies", qty, 0))
FROM (
SELECT
a.shipmentid shipmentid,
a.quantity quantity,
a.category category
FROM
[myDataset.myTable] a)
GROUP EACH BY
shipmentID ) SetC
ON
SetB.shipmentID = SetC.shipmentID
The above pattern can be continued indefinitely by adding INNER JOIN EACH
segments one after the other. For my application, BQ was able to handle about 500 columns per chunk.
This is a way to do:
select shipmentID,
sum(IF (category='shoes', quantity, 0)) AS shoes,
sum(IF (category='hats', quantity, 0)) AS hats,
sum(IF (category='shirts', quantity, 0)) AS shirts,
sum(IF (category='toys', quantity, 0)) AS toys,
sum(IF (category='books', quantity, 0)) AS books,
from
(select 1 as shipmentID, 'shoes' as category, 5 as quantity),
(select 1 as shipmentID, 'hats' as category, 3 as quantity),
(select 2 as shipmentID, 'shirts' as category, 1 as quantity),
(select 2 as shipmentID, 'hats' as category, 2 as quantity),
(select 3 as shipmentID, 'toys' as category, 3 as quantity),
(select 2 as shipmentID, 'books' as category, 1 as quantity),
(select 3 as shipmentID, 'shirts' as category, 1 as quantity),
group by shipmentID
This returns:
+-----+------------+-------+------+--------+------+-------+---+
| Row | shipmentID | shoes | hats | shirts | toys | books | |
+-----+------------+-------+------+--------+------+-------+---+
| 1 | 1 | 5 | 3 | 0 | 0 | 0 | |
| 2 | 2 | 0 | 2 | 1 | 0 | 1 | |
| 3 | 3 | 0 | 0 | 1 | 3 | 0 | |
+-----+------------+-------+------+--------+------+-------+---+
See the manual for other pivot table example.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With