Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to perform stratification based on a column in Snowflake

I am using Snowflake to write my sql queries. We have a huge table with billions of records containing customer information. The goal is to get random sample out and use R to look at the distributions. Unfortunately, we can't use JDBC/ODBC connection from RStudio to the Database. It is a limitation. So I am left with pulling extract from Snowflake and importing into R.

The difficulty, is we have a column called CUSTOMER SEGMENT which has almost 24 unique values. The goal is to get a sample which represents significant proportion from each segment. I tried the following query;

SELECT DISTINCT *
FROM test sample(10)

to obtain random sample where each row has 10 percent probability of being selected. But I am not obtaining sample from each values of customer segment. May I know of any sql commands, which can help to stratify based on Customer segment. thanks in advance.

like image 495
jay Avatar asked Oct 15 '25 15:10

jay


2 Answers

An alternative way of sampling for more equal-sized partitions is to use round robin sampling

select t.*
from (select t.*, 
             row_number() over (partition by segment order by random()) as seqnum,
             count(*) over () as cnt
      from test t
     ) t
where seqnum <= 20;

The "20" says up to 20 rows for each segment.

This can be modified for a percentage based sample. It is not clear if that is necessary.

like image 87
Gordon Linoff Avatar answered Oct 17 '25 05:10

Gordon Linoff


For some large number of rows, this will approach a stratified sample.

select     *
from       test
order by   row_number() over (partition by segment_1, segment_2 order by random()) /
           count(*) over (partition by segment_1, segment_2)
limit      1000000
like image 20
jasonchang2018 Avatar answered Oct 17 '25 06:10

jasonchang2018