Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Getting around BigQuery subquery & apply limitations

I have a SQL Server query that I'm trying to convert to run in BigQuery. There are three tables involved:

CalendarMonths

     FirstDayOfMonth        |    FirstDayOfNextMonth
----------------------------+----------------------------
2017-02-01 00:00:00.000 UTC | 2017-03-01 00:00:00.000 UTC
2017-03-01 00:00:00.000 UTC | 2017-04-01 00:00:00.000 UTC

Clients

clientid |     name       | etc.
---------+----------------+------
1        |  Bob's Shop    |
2        | Anne's Cookies |

ClientLogs

 id | clientid |   timestamp    | price_current | price_old | license_count_current | license_count_old |
----+----------+----------------+---------------+-----------+-----------------------+---------------
1   |     1    | 2017-02-01 UTC |      1200     |     0     |          10           |          0        |
2   |     1    | 2018-02-03 UTC |      2400     |    1200   |          20           |         10        |
3   |     2    | 2016-07-13 UTC |      1200     |     0     |          10           |          0        |
4   |     2    | 2018-03-30 UTC |       0       |    1200   |           0           |         10        |

The T-SQL query looks something like this:

SELECT 
    FirstDayOfMonth, FirstDayOfNextMonth,
    (SELECT SUM(sizeatdatelog.price_current) 
     FROM clients c
     CROSS APPLY (SELECT TOP 1 * 
                  FROM clientlogs 
                  WHERE clientid = c.clientid 
                    AND [timestamp] < cm.FirstDayOfMonth 
                  ORDER BY [timestamp] DESC) sizeatdatelog
     WHERE sizeatdatelog.license_count_current > 0) as StartingRevenue,
    (another subquery for starting client count) as StartingClientCount,
    (another subquery for churned revenue) as ChurnedRevenue,
    (there are about 6 other subqueries)
FROM 
    CalendarMonths cm
ORDER BY 
    cm.FirstDayOfMonth

And the final output looks like:

     FirstDayOfMonth        |    FirstDayOfNextMonth      | StartingRevenue | StartingClientCount | etc
-------------------------------------------------------------------------------------------------------
2017-02-01 00:00:00.000 UTC | 2017-03-01 00:00:00.000 UTC |   68382995.43   |        79430        |
2017-03-01 00:00:00.000 UTC | 2017-04-01 00:00:00.000 UTC |   69843625.12   |        80430        |

In BigQuery, I added a simple subquery in the select clause and it worked great:

SELECT FirstDayOfMonth, FirstDayOfNextMonth, (SELECT clientId FROM clientlogs LIMIT 1 ) as cl 
FROM CalendarMonths cm
ORDER BY cm.FirstDayOfMonth

However, as soon as I add a where clause to the subquery, I get this error message:

Error: Correlated subqueries that reference other tables are not supported unless they can be de-correlated, such as by transforming them into an efficient JOIN.

How should I proceed from this point? If I can't get the results I'm looking for in one query, maybe I should look into creating multiple scheduled jobs that create temporary tables and then a final scheduled job that joins it all together. Or maybe I could look at doing this in code via GCP or use the BigQuery API in app scripts. The data size isn't huge and the query isn't run often. I'm looking for maintainability more than efficiency, so ideally there is a way to get this data into one query.

like image 203
Jonathan McIntire Avatar asked Oct 16 '22 17:10

Jonathan McIntire


1 Answers

Below is for BigQuery Standard SQL

#standardSQL
SELECT FirstDayOfMonth, FirstDayOfNextMonth, 
  SUM(price_current) StartingRevenue, COUNT(1) StartingClientCount 
FROM (
  SELECT FirstDayOfMonth, FirstDayOfNextMonth, 
    clientid, price_current
  FROM (
    SELECT FirstDayOfMonth, FirstDayOfNextMonth, clientid,
      FIRST_VALUE(price_current) OVER(latest_values) price_current,
      FIRST_VALUE(license_count_current) OVER(latest_values) license_count_current
    FROM `project.dataset.CalendarMonths` cm
    JOIN `project.dataset.ClientLogs` cl
    ON `timestamp` < FirstDayOfMonth 
    WINDOW latest_values AS (PARTITION BY clientid ORDER BY `timestamp` DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
  )
  WHERE license_count_current > 0
  GROUP BY FirstDayOfMonth, FirstDayOfNextMonth, clientid, price_current
)
GROUP BY FirstDayOfMonth, FirstDayOfNextMonth
ORDER BY FirstDayOfMonth  

most likely above can be extended to the rest of your subqueries

like image 156
Mikhail Berlyant Avatar answered Oct 21 '22 09:10

Mikhail Berlyant