Stratified Sampling in Hive

Tags:

The following returns a 10% sample of the A and X columns stratified by the values of X.

  select A, X from(
  select A, 
      count(*) over (partition by X) as cnt, 
      rank() over (partition by X order by rand()) as rnk
      from my_table) table 
  where rnk <= cnt*0.1

In other words, if X takes the values [X0, X1] it returns the union of:

10% of the the rows where X = X0
10% of the the rows where X = X1

How can I stratify my query by values of tuples for several columns (e.g. X, Y)?

For example, if X takes values [X0, X1] and Y takes values [Y0, Y1], I would like to get a sample that is the union of:

10% of the the rows where X = X0 and Y=Y0
10% of the the rows where X = X0 and Y=Y1
10% of the the rows where X = X1 and Y=Y0
10% of the the rows where X = X1 and Y=Y1

933

asked Aug 12 '14 21:08

Amelio Vazquez-Reina

1 Answers

I'd use your method above, but use a hash of the columns you'd like to consider.

139

answered Oct 30 '22 12:10

Jim Murphy

Related questions
                            
                                AJAX update MYSQL database using function called from HTML generated from PHP
                            
                                Can a foreign key reference multiple tables? [duplicate]
                            
                                Transaction isolation level - choosing the right one
                            
                                In JPA, is there a DB agnostic way to check if a table exists?
                            
                                What is MINI_THUMB_MAGIC and how to use it?
                            
                                Executing script by using function callproc from cx_Oracle module in python 2.7.5
                            
                                Adding Many (UDFs) Validation Functions to Oracle - Which Method Run Fastest
                            
                                How to pass a list of parameter to stored procedure and perform batch insert in SQL Server
                            
                                Oracle DB Order Tree Siblings By Sibling Linked List
                            
                                Using Oracle SQL as a matching engine
                            
                                SQL Optimising a spatial index for localised geography points
                            
                                Rapidly fragmenting index in SQL Server
                            
                                Get total count in addition of count if user voted
                            
                                Return NULL for missing values in an IN list
                            
                                Why is my Entity Framework query with Single slow?
                            
                                JPA Specification equivalent of ResultTransformer.DISTINCT_ROOT_ENTITY?
                            
                                MySQL Loop Variable in a Query
                            
                                Dynamic fieldnames in subquery?
                            
                                audit table vs. Type 2 Slowly Changing Dimension
                            
                                mySQL Query seems to crash server

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Stratified Sampling in Hive

Tags:

sql

hive

qubole

Amelio Vazquez-Reina

People also ask

1 Answers

Jim Murphy

Recent Activity

Donate For Us