How can I do stratified sampling on BigQuery?
For example, we want a 10% proportionate stratified sample using the category_id as the strata. We have up to 11000 category_ids in some of our tables.
Use RAND function Before TABLESAMPLE is added, RAND function is used as an alternative to retrieve random sample subset. The querying cost is big as the whole table will be scanned to generate one random number for each record.
Selecting a Randomly Distributed Sample from BigQuery Tables. Here are the two methods to select tables in BigQuery Random Sampling: Use the RAND Function. Use the TABLESAMPLE Clause.
If you want to sample individual rows, rather than data blocks, then you can use a WHERE rand() < K clause instead. However, this approach requires BigQuery to scan the entire table. To save costs but still benefit from row-level sampling, you can combine both techniques.
I think the simplest way to get a proportionate stratified sample is to order the data by the categories and do an "nth" sample of the data. For a 10% sample, you want every 10 rows.
This looks like:
select t.*
from (select t.*,
row_number() over (order by category order by rand()) as seqnum
from t
) t
where seqnum % 10 = 1;
Note: This does not guarantee that all categories will be in the final sample. A category with fewer than 10 rows may not appear.
If you want equal sized samples, then order within each category and just take a fixed number:
select t.*
from (select t.*,
row_number() over (partition by category order by rand()) as seqnum
from t
) t
where seqnum <= 100;
Note: This does not guarantee that 100 rows exist within each category. It takes all rows for smaller categories and a random sample of larger ones.
Both these methods are quite handy. They can work with multiple dimensions at the same time. The first has a particularly nice feature that it can also work with numeric dimensions as well.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With