Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Stratified random sampling with BigQuery?

How can I do stratified sampling on BigQuery?

For example, we want a 10% proportionate stratified sample using the category_id as the strata. We have up to 11000 category_ids in some of our tables.

like image 403
Felipe Hoffa Avatar asked Oct 20 '18 00:10

Felipe Hoffa


People also ask

How do you random sample a BigQuery?

Use RAND function Before TABLESAMPLE is added, RAND function is used as an alternative to retrieve random sample subset. The querying cost is big as the whole table will be scanned to generate one random number for each record.

How do you generate random numbers in BigQuery?

Selecting a Randomly Distributed Sample from BigQuery Tables. Here are the two methods to select tables in BigQuery Random Sampling: Use the RAND Function. Use the TABLESAMPLE Clause.

How do you find sample data in BigQuery?

If you want to sample individual rows, rather than data blocks, then you can use a WHERE rand() < K clause instead. However, this approach requires BigQuery to scan the entire table. To save costs but still benefit from row-level sampling, you can combine both techniques.


1 Answers

I think the simplest way to get a proportionate stratified sample is to order the data by the categories and do an "nth" sample of the data. For a 10% sample, you want every 10 rows.

This looks like:

select t.*
from (select t.*,
             row_number() over (order by category order by rand()) as seqnum
      from t
     ) t
where seqnum % 10 = 1;

Note: This does not guarantee that all categories will be in the final sample. A category with fewer than 10 rows may not appear.

If you want equal sized samples, then order within each category and just take a fixed number:

select t.*
from (select t.*,
             row_number() over (partition by category order by rand()) as seqnum
      from t
     ) t
where seqnum <= 100;

Note: This does not guarantee that 100 rows exist within each category. It takes all rows for smaller categories and a random sample of larger ones.

Both these methods are quite handy. They can work with multiple dimensions at the same time. The first has a particularly nice feature that it can also work with numeric dimensions as well.

like image 102
Gordon Linoff Avatar answered Sep 20 '22 12:09

Gordon Linoff