What is the best query to sample from Impala for a huge database?

Question

I have a huge table (more than 1 billion rows) in Impala. I need to sample ~ 100,000 rows several times. What is the best to query sample rows?

Matt · Accepted Answer

As Jeff mentioned, what you've asked for exactly isn't possible yet, but we do have an internal aggregate function which takes 200,000 samples (using reservoir sampling) and returns the samples, comma-delimited as a single row. There is no way to change the number of samples yet. If there are fewer than 200,000 rows, all will be returned. If you're interested in how this works, see the implementation of the aggregate function and reservoir sampling structures.

There isn't a way to 'split' or explode the results yet, either, so I don't know how helpful this will be.

For example, sampling trivially from a table with 8 rows:

> select sample(id) from functional.alltypestiny
+------------------------+
| sample(id)             |
+------------------------+
| 0, 1, 2, 3, 4, 5, 6, 7 |
+------------------------+
Fetched 1 row(s) in 4.05s

(For context: this was added in a past release to support histogram statistics in the planner, which unfortunately isn't ready yet.)

Jeff Hammerbacher · Answer

Impala does not currently support TABLESAMPLE, unfortunately. See https://issues.cloudera.org/browse/IMPALA-1924 to follow its development.

Tony Bartoletti · Answer

In retrospect, knowing that TABLESAMPLE is unavailable, one could add a field "RVAL" (random 32-bit integer, for instance) to each record, and sample repeatedly by adding "where RVAL > x and RVAL < y", for appropriate values of x and y. Non-overlapping intervals [x1,y1], [x2,y2],... will be independent. You can also select using "where RVAL%10000 = 1, =2, ... etc, for a separate population of independent subsets.

What is the best query to sample from Impala for a huge database?

Tags:

random

nosql

impala

Soroosh

3 Answers

Matt

Jeff Hammerbacher

Tony Bartoletti

Recent Activity

Donate For Us

What is the best query to sample from Impala for a huge database?

Tags:

random

nosql

impala

Soroosh

3 Answers

Matt

Jeff Hammerbacher

Tony Bartoletti

Related questions

Recent Activity

Donate For Us