Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the best query to sample from Impala for a huge database?

I have a huge table (more than 1 billion rows) in Impala. I need to sample ~ 100,000 rows several times. What is the best to query sample rows?

like image 933
Soroosh Avatar asked Jul 20 '15 16:07

Soroosh


3 Answers

As Jeff mentioned, what you've asked for exactly isn't possible yet, but we do have an internal aggregate function which takes 200,000 samples (using reservoir sampling) and returns the samples, comma-delimited as a single row. There is no way to change the number of samples yet. If there are fewer than 200,000 rows, all will be returned. If you're interested in how this works, see the implementation of the aggregate function and reservoir sampling structures.

There isn't a way to 'split' or explode the results yet, either, so I don't know how helpful this will be.

For example, sampling trivially from a table with 8 rows:

> select sample(id) from functional.alltypestiny
+------------------------+
| sample(id)             |
+------------------------+
| 0, 1, 2, 3, 4, 5, 6, 7 |
+------------------------+
Fetched 1 row(s) in 4.05s

(For context: this was added in a past release to support histogram statistics in the planner, which unfortunately isn't ready yet.)

like image 54
Matt Avatar answered Oct 16 '22 07:10

Matt


Impala does not currently support TABLESAMPLE, unfortunately. See https://issues.cloudera.org/browse/IMPALA-1924 to follow its development.

like image 1
Jeff Hammerbacher Avatar answered Oct 16 '22 05:10

Jeff Hammerbacher


In retrospect, knowing that TABLESAMPLE is unavailable, one could add a field "RVAL" (random 32-bit integer, for instance) to each record, and sample repeatedly by adding "where RVAL > x and RVAL < y", for appropriate values of x and y. Non-overlapping intervals [x1,y1], [x2,y2],... will be independent. You can also select using "where RVAL%10000 = 1, =2, ... etc, for a separate population of independent subsets.

like image 1
Tony Bartoletti Avatar answered Oct 16 '22 07:10

Tony Bartoletti