I need to take a random sample of customers who have purchased from different categories. There are 8 categories, and orders are spread among them.
If I wanted to take a random sample of customers who have made a purchase, but keep the proportion of orders per category the same, how would I set that up in my sql code?
A table highlighting this is below (it doesn't include customer data - I want my list of customers to be based of the representative proportion of orders):
table that I am using can be found here: https://imgur.com/a/Q0lMHWf
Basically, you would use row_number()
, order by
, and choose the nth values. So, for about a 1% stratified sample, do:
select t.*
from (select t.*,
row_number() over (order by category, rand()) as seqnum
from t
) t
where mod(seqnum, 101) = 1
order by category;
The basic idea is that you can get a stratified sample by ordering the result set by the categories and doing an nth sample of the result.
@Dnoeth solution is effective and easy to use. But, please change rand() into random(). AWS redshift's random function name is 'random'. Please refer https://docs.aws.amazon.com/redshift/latest/dg/r_RANDOM.html. 'random() works for me but not rand()
select *
from
(
select t.*
,percent_rank()
over (partition by category
order by random()) as pr
from tab
) as dt
where pr <= 0.1 -- similar to 10% sample
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With