I'm trying to figure out what is the best way to take a random sample of 100 records for each group in a table in Big Query.
For example, I have a table where column A is a unique recordID, and column B is the groupID to which the record belongs. For every distinct groupID, I would like to take a random sample of 100 recordIDs. Is there a simple way to complete this?
Something like below should work
SELECT recordID, groupID
FROM (
SELECT
recordID, groupID,
RAND() AS rnd, ROW_NUMBER() OVER(PARTITION BY groupID ORDER BY rnd) AS pos
FROM yourTable
)
WHERE pos <= 100
ORDER BY groupID, recordID
Also check RAND() here if you want to improve randomness
Had a similar need, namely cluster sampling, over 400M and more columns but hit Exceeded resources...
error when using ROW_NUMBER()
.
If you don't need RAND()
because your data is unordered anyway, this performs quite well (<30s in my case):
SELECT ARRAY_AGG(x LIMIT 100)
FROM yourtable x
GROUP BY groupId
You can:
UNNEST()
if front-end cannot render nested recordsORDER BY groupId
to find/confirm patterns more quicklyIf you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With