How can I do stratified sampling on BigQuery? For example, we want a 10% proportionate stratified sample using the category_id as the strata. We have up to 11000 category_ids in some of our tables.

I think the simplest way to get a proportionate stratified sample is to order the data by the categories and do an "nth" sample of the data. For a 10% sample, you want every 10 rows. This looks like: <pre class="prettyprint"><code>select t.* from (select t.*, row_number() over (order by category order by rand()) as seqnum from t ) t where seqnum % 10 = 1; </code></pre> Note: This does not guarantee that all categories will be in the final sample. A category with fewer than 10 rows may not appear. If you want equal sized samples, then order within each category and just take a fixed number: <pre class="prettyprint"><code>select t.* from (select t.*, row_number() over (partition by category order by rand()) as seqnum from t ) t where seqnum <= 100; </code></pre> Note: This does not guarantee that 100 rows exist within each category. It takes all rows for smaller categories and a random sample of larger ones. Both these methods are quite handy. They can work with multiple dimensions at the same time. The first has a particularly nice feature that it can also work with numeric dimensions as well.

Stratified random sampling with BigQuery?

1 Answers

I think the simplest way to get a proportionate stratified sample is to order the data by the categories and do an "nth" sample of the data. For a 10% sample, you want every 10 rows.

This looks like:

select t.*
from (select t.*,
             row_number() over (order by category order by rand()) as seqnum
      from t
     ) t
where seqnum % 10 = 1;

Note: This does not guarantee that all categories will be in the final sample. A category with fewer than 10 rows may not appear.

If you want equal sized samples, then order within each category and just take a fixed number:

select t.*
from (select t.*,
             row_number() over (partition by category order by rand()) as seqnum
      from t
     ) t
where seqnum <= 100;

Note: This does not guarantee that 100 rows exist within each category. It takes all rows for smaller categories and a random sample of larger ones.

Both these methods are quite handy. They can work with multiple dimensions at the same time. The first has a particularly nice feature that it can also work with numeric dimensions as well.

102

answered Sep 20 '22 12:09

Gordon Linoff

Related questions
                            
                                SQL Server json truncated (even when using NVARCHAR(max) )
                            
                                Removing Duplicate Rows in PostgreSQL with multiple columns
                            
                                SQL Query to get latest price
                            
                                What's the preferred way to return an empty table in SQL?
                            
                                varchar(max) MS SQL Server 2000, problems?
                            
                                Are there any differences between SQL Server and MySQL when it comes to preventing SQL injection?
                            
                                How to see a list of all the indexes (including implicit ones) in SQL*Plus?
                            
                                SQL ignore case in group by? (oracle)
                            
                                MySQL dump into CSV text files with column names at the top? [duplicate]
                            
                                Joining tables with LIKE (SQL)
                            
                                getting Schema of one Table in C#
                            
                                mysql pivot/crosstab query
                            
                                What is the correct CASE SELECT Statement in Access 2010? [duplicate]
                            
                                Creating a Weighted Average - Dropping Weights for NULL values
                            
                                Joining 2 select queries on 2 different tables in PostgreSQL
                            
                                Using output to set a variable in a merge statement
                            
                                postgresql syntax error when creating a table
                            
                                How do SQL order-by with multiple-columns work?
                            
                                Truncate seconds and milliseconds in SQL
                            
                                Stop export process in SQL Developer

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Stratified random sampling with BigQuery?

Tags:

sql

google-bigquery

Felipe Hoffa

People also ask

1 Answers

Gordon Linoff

Recent Activity

Donate For Us