Any ideas how to make this query return results on Google BigQuery? I'm getting a resources exceeded error... There are about 2B rows in the dataset. I'm trying to get the artist ID that appears the most for each user_id.
select user_id, artist, count(*) as count
from [legacy20130831.merged_data] as d
group each by user_id, artist
order by user_id ASC, count DESC
An equivalent query on public data, that throws the same error:
SELECT actor, repository_name, count(*) AS count
FROM [githubarchive:github.timeline] AS d
GROUP EACH BY actor, repository_name
ORDER BY actor, count desc
Compare with the same query, plus a limit on the results to be returned. This one works (14 seconds for me):
SELECT actor, repository_name, count(*) as count
FROM [githubarchive:github.timeline] as d
GROUP EACH BY actor, repository_name
ORDER BY actor, count desc
LIMIT 100
Instead of using a LIMIT, you could go through a fraction of the user_ids. In my case, a 1/3 works:
SELECT actor, repository_name, count(*) as count
FROM [githubarchive:github.timeline] as d
WHERE ABS(HASH(actor) % 3) = 0
GROUP EACH BY actor, repository_name
But what you really want is "to get the artist ID that appears the most for each user_id". Let's go further, and get that:
SELECT actor, repository_name, count FROM (
SELECT actor, repository_name, count, ROW_NUMBER() OVER (PARTITION BY actor ORDER BY count DESC) rank FROM (
SELECT actor, repository_name, count(*) as count
FROM [githubarchive:github.timeline] as d
WHERE ABS(HASH(actor) % 10) = 0
GROUP EACH BY actor, repository_name
))
WHERE rank=1
Note that this time I used %10, as it gets me results faster. But you might be wondering "I want to get my results with one query, not 10".
There are 2 things you can do for that:
If you are willing to share your dataset with me, I could provide dataset specific advice (a lot depends on cardinality).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With