I am hitting this obstacle again and again...
JOIN EACH and GROUP EACH BY clauses can't be used on the output of window functions
Is there a best practice or recommendations how to use window functions (Over()) with very large data sets that cannot be processed on a single node?
Fragmenting my data and running the same query with different filters can work, but its very limiting, takes lot of time (and manual labor) and costly (running same query on the same data set 30 times instead of once).
Referring to Jeremy's answer bellow... It's better, but still doesn't work properly. If I take my original query sample:
select title,count (case when contributor_id<>LeadContributor then 1 else null end) as different,
count (case when contributor_id=LeadContributor then 1 else null end) as same,
count(*) as total
from
(
SELECT title,contributor_id,lead(contributor_id)over(partition by title order by timestamp) as LeadContributor
FROM [publicdata:samples.wikipedia]
where regexp_match(title,r'^[A,B]')=true
)
group by title
Now works... But
select title,count (case when contributor_id<>LeadContributor then 1 else null end) as different,
count (case when contributor_id=LeadContributor then 1 else null end) as same,
count(*) as total
from
(
SELECT title,contributor_id,lead(contributor_id)over(partition by title order by timestamp) as LeadContributor
FROM [publicdata:samples.wikipedia]
where regexp_match(title,r'^[A-Z]')=true
)
group each by title
Gives again the Resources Exceeded Error...
Window functions can now be executed in distributed fashion according to the PARTITION BY clause given inside OVER. If you supply a PARTITION BY with your window functions, your data will be processed in parallel similar to how JOIN EACH and GROUP EACH BY are processed.
In addition, you can use PARTITION BY on the output of JOIN EACH or GROUP EACH BY without serializing execution. Using the same keys for PARTITION BY as for JOIN EACH or GROUP EACH BY is particularly efficient, because the data will not need to be reshuffled between join/aggregation and window function execution.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With