Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

stratified sampling in pig?

Does anyone have an idea of how to make a stratified sampling in pig? (wikipedia)

For the moment, I do something like :

relation2 = SAMPLE relation1 0.05;

but my dataset contains a label columns with a few occurrences, some of them are rare (0.5 % for example) and I would like my random down sampling not to forget all of them.

Thanks a lot.

like image 459
Scratch Avatar asked Mar 12 '26 12:03

Scratch


1 Answers

You could implement your own method of sampling by using RANDOM() and then filtering out rows with values below, say, 0.95. So, if you want to stratify this sampling, you could compute what fraction of your rows contain a certain value, and then scale your random value accordingly so that different values get sampled at different rates.

like image 77
reo katoa Avatar answered Mar 15 '26 14:03

reo katoa



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!