Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Taking a Random Sample From Each Group in Big Query

I'm trying to figure out what is the best way to take a random sample of 100 records for each group in a table in Big Query.

For example, I have a table where column A is a unique recordID, and column B is the groupID to which the record belongs. For every distinct groupID, I would like to take a random sample of 100 recordIDs. Is there a simple way to complete this?

like image 239
Tas Avatar asked Feb 08 '23 11:02

Tas


2 Answers

Something like below should work

SELECT recordID, groupID
FROM (
  SELECT 
    recordID, groupID, 
    RAND() AS rnd, ROW_NUMBER() OVER(PARTITION BY groupID ORDER BY rnd) AS pos
  FROM yourTable
)
WHERE pos <= 100
ORDER BY groupID, recordID

Also check RAND() here if you want to improve randomness

like image 134
Mikhail Berlyant Avatar answered Feb 09 '23 23:02

Mikhail Berlyant


Had a similar need, namely cluster sampling, over 400M and more columns but hit Exceeded resources... error when using ROW_NUMBER().

If you don't need RAND() because your data is unordered anyway, this performs quite well (<30s in my case):

SELECT ARRAY_AGG(x LIMIT 100) 
FROM yourtable x 
GROUP BY groupId

You can:

  • decorate with UNNEST() if front-end cannot render nested records
  • add ORDER BY groupId to find/confirm patterns more quickly
like image 26
Yannick Einsweiler Avatar answered Feb 10 '23 01:02

Yannick Einsweiler