Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Stratified Sampling in Hive

Tags:

sql

hive

qubole

The following returns a 10% sample of the A and X columns stratified by the values of X.

  select A, X from(
  select A, 
      count(*) over (partition by X) as cnt, 
      rank() over (partition by X order by rand()) as rnk
      from my_table) table 
  where rnk <= cnt*0.1

In other words, if X takes the values [X0, X1] it returns the union of:

  • 10% of the the rows where X = X0
  • 10% of the the rows where X = X1

How can I stratify my query by values of tuples for several columns (e.g. X, Y)?

For example, if X takes values [X0, X1] and Y takes values [Y0, Y1], I would like to get a sample that is the union of:

  • 10% of the the rows where X = X0 and Y=Y0
  • 10% of the the rows where X = X0 and Y=Y1
  • 10% of the the rows where X = X1 and Y=Y0
  • 10% of the the rows where X = X1 and Y=Y1
like image 933
Amelio Vazquez-Reina Avatar asked Aug 12 '14 21:08

Amelio Vazquez-Reina


People also ask

What is stratification sampling method?

What is stratified sampling? In stratified sampling, researchers divide subjects into subgroups called strata based on characteristics that they share (e.g., race, gender, educational attainment). Once divided, each subgroup is randomly sampled using another probability sampling method.

What is stratified sampling in data mining?

A stratified sample is defined as one resulting from classification of population into mutually exclusive groups, called strata, and choosing a simple random sample from each stratum. The main reason for using stratified sampling instead of simple random sampling is improved efficiency of sampling [2,3].

What is stratified random sampling with example?

Stratified random sampling is a method of sampling that involves the division of a population into smaller subgroups known as strata. In stratified random sampling, or stratification, the strata are formed based on members' shared attributes or characteristics, such as income or educational attainment.

What is sampling in hive?

In big data scenarios , when data volume is huge, we may need to find a subset of data to speed up data analysis. Here comes a technique to select and analyze a subset of data in order to identify patterns and trends in the data known as sampling.


1 Answers

I'd use your method above, but use a hash of the columns you'd like to consider.

like image 139
Jim Murphy Avatar answered Oct 30 '22 12:10

Jim Murphy