Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

BigQuery - select top N posts from a large table for each subreddit

I am doing data mining on Reddit data on Google BigQuery and I wanna top 1000 posts ranked by the score for each subreddit for the whole 201704 data. I have tried different techniques but due to the limitation of BigQuery, the result is too large to return.

select body, score, subreddit from 
  (
    select body, score, subreddit,row_number() over 
      (
        partition by subreddit order by score desc
      ) mm 
      from [fh-bigquery:reddit_comments.2017_04]
  )
  where mm <= 1000 AND subreddit in 
  (
    select subreddit from 
    (
      select Count(subreddit) as counts, subreddit from 
      [fh-bigquery:reddit_comments.2017_04] GROUP BY subreddit ORDER BY counts DESC 
      LIMIT 10000
    )
  )
LIMIT 10000000

Is there any way to divide and conquer this problem since enabling large query results means could not do any complex query. Does Google provide payment option for large query resource?

like image 925
Julian.Wu Avatar asked Jun 18 '17 00:06

Julian.Wu


People also ask

What is offset in BigQuery?

OFFSET means that the numbering starts at zero, ORDINAL means that the numbering starts at one. A given array can be interpreted as either 0-based or 1-based. When accessing an array element, you must preface the array position with OFFSET or ORDINAL , respectively; there is no default behavior.

How do you use limits in BigQuery?

The following limits apply when you load data into BigQuery, using the console, the bq command-line tool, or the load-type jobs. insert API method. Load jobs, including failed load jobs, count toward the limit on the number of table operations per day for the destination table.

How do I select nested fields in BigQuery?

BigQuery automatically flattens nested fields when querying. To query a column with nested data, each field must be identified in the context of the column that contains it. For example: customer.id refers to the id field in the customer column.


1 Answers

I wanna top 1000 posts ranked by the score for each subreddit for the whole 201704 data

I just tested this query:

SELECT 
  subreddit,
  ARRAY_AGG(STRUCT(body, score) ORDER BY score DESC LIMIT 1000) data
FROM `fh-bigquery.reddit_comments.2017_04`
GROUP BY 1

It processed the whole dataset in 22s:

enter image description here

In your query it seems though that you want the posts and scores of the top 10000 most popular subreddits. I tried this query:

SELECT 
  subreddit,
  ARRAY_AGG(STRUCT(body, score) ORDER BY score DESC LIMIT 1000) data
FROM `fh-bigquery.reddit_comments.2017_04`
WHERE subreddit IN(
  SELECT subreddit FROM(
    SELECT
      subreddit
    FROM `fh-bigquery.reddit_comments.2017_04`               
    GROUP BY subreddit
    ORDER BY count(body) DESC
    LIMIT 10000)
  )
GROUP BY 1

And got results in 26s:

enter image description here

Hopefully these results are what you are looking for. Let me know if everything is correct.

like image 153
Willian Fuks Avatar answered Oct 23 '22 03:10

Willian Fuks