Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to do a stratified random sample in Redshift?

I need to take a random sample of customers who have purchased from different categories. There are 8 categories, and orders are spread among them.

If I wanted to take a random sample of customers who have made a purchase, but keep the proportion of orders per category the same, how would I set that up in my sql code?

A table highlighting this is below (it doesn't include customer data - I want my list of customers to be based of the representative proportion of orders):

table that I am using can be found here: https://imgur.com/a/Q0lMHWf

like image 547
nz426 Avatar asked Dec 17 '22 19:12

nz426


2 Answers

Basically, you would use row_number(), order by, and choose the nth values. So, for about a 1% stratified sample, do:

select t.*
from (select t.*,
             row_number() over (order by category, rand()) as seqnum
      from t
     ) t
where mod(seqnum, 101) = 1
order by category;

The basic idea is that you can get a stratified sample by ordering the result set by the categories and doing an nth sample of the result.

like image 82
Gordon Linoff Avatar answered Jan 05 '23 01:01

Gordon Linoff


@Dnoeth solution is effective and easy to use. But, please change rand() into random(). AWS redshift's random function name is 'random'. Please refer https://docs.aws.amazon.com/redshift/latest/dg/r_RANDOM.html. 'random() works for me but not rand()

select *

from
(
 select t.*
  ,percent_rank()
   over (partition by category
         order by random()) as pr
from tab
) as dt 
where pr <= 0.1 -- similar to 10% sample
like image 33
Veeramani Natarajan Avatar answered Jan 05 '23 01:01

Veeramani Natarajan