The following returns a 10%
sample of the A
and X
columns stratified by the values of X
.
select A, X from(
select A,
count(*) over (partition by X) as cnt,
rank() over (partition by X order by rand()) as rnk
from my_table) table
where rnk <= cnt*0.1
In other words, if X
takes the values [X0, X1]
it returns the union of:
X = X0
X = X1
How can I stratify my query by values of tuples for several columns (e.g. X
, Y
)?
For example, if X
takes values [X0, X1]
and Y takes values [Y0, Y1]
, I would like to get a sample that is the union of:
X = X0
and Y=Y0
X = X0
and Y=Y1
X = X1
and Y=Y0
X = X1
and Y=Y1
What is stratified sampling? In stratified sampling, researchers divide subjects into subgroups called strata based on characteristics that they share (e.g., race, gender, educational attainment). Once divided, each subgroup is randomly sampled using another probability sampling method.
A stratified sample is defined as one resulting from classification of population into mutually exclusive groups, called strata, and choosing a simple random sample from each stratum. The main reason for using stratified sampling instead of simple random sampling is improved efficiency of sampling [2,3].
Stratified random sampling is a method of sampling that involves the division of a population into smaller subgroups known as strata. In stratified random sampling, or stratification, the strata are formed based on members' shared attributes or characteristics, such as income or educational attainment.
In big data scenarios , when data volume is huge, we may need to find a subset of data to speed up data analysis. Here comes a technique to select and analyze a subset of data in order to identify patterns and trends in the data known as sampling.
I'd use your method above, but use a hash of the columns you'd like to consider.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With