Taking a Random Sample From Each Group in Big Query

Question

I'm trying to figure out what is the best way to take a random sample of 100 records for each group in a table in Big Query.

For example, I have a table where column A is a unique recordID, and column B is the groupID to which the record belongs. For every distinct groupID, I would like to take a random sample of 100 recordIDs. Is there a simple way to complete this?

Mikhail Berlyant · Accepted Answer

Something like below should work

SELECT recordID, groupID
FROM (
  SELECT 
    recordID, groupID, 
    RAND() AS rnd, ROW_NUMBER() OVER(PARTITION BY groupID ORDER BY rnd) AS pos
  FROM yourTable
)
WHERE pos <= 100
ORDER BY groupID, recordID

Also check RAND() here if you want to improve randomness

Yannick Einsweiler · Answer

Had a similar need, namely cluster sampling, over 400M and more columns but hit Exceeded resources... error when using ROW_NUMBER().

If you don't need RAND() because your data is unordered anyway, this performs quite well (<30s in my case):

SELECT ARRAY_AGG(x LIMIT 100) 
FROM yourtable x 
GROUP BY groupId

You can:

decorate with UNNEST() if front-end cannot render nested records
add ORDER BY groupId to find/confirm patterns more quickly

Taking a Random Sample From Each Group in Big Query

Tags:

random

google-bigquery

Tas

2 Answers

Mikhail Berlyant

Yannick Einsweiler

Recent Activity

Donate For Us

Taking a Random Sample From Each Group in Big Query

Tags:

random

google-bigquery

Tas

2 Answers

Mikhail Berlyant

Yannick Einsweiler

Related questions

Recent Activity

Donate For Us