I'm trying to obtain a random sample of <code>N</code> rows from Athena. But since the table from which I want to draw this sample is huge the naive <pre class="prettyprint"><code>SELECT id FROM mytable ORDER BY RANDOM() LIMIT 100 </code></pre> takes forever to run, presumably because the <code>ORDER BY</code> requires all data to be sent to a single node, which then shuffles and orders the data. I know about <code>TABLESAMPLE</code> but that allows one to sample some percentage of rows rather than some number of them. Is there a better way of doing this?

Athena is actually behind Presto. You can use TABLESAMPLE to get a random sample of your table. Lets say you want 10% sample of your table, your query will be something like: <code>SELECT id FROM mytable TABLESAMPLE BERNOULLI(10)</code> Pay attention that there is BERNOULLI and SYSTEM sampling. Here is the documentation for it.

random sample of size N in Athena

Tags:

I'm trying to obtain a random sample of N rows from Athena. But since the table from which I want to draw this sample is huge the naive

SELECT
id
FROM mytable
ORDER BY RANDOM()
LIMIT 100

takes forever to run, presumably because the ORDER BY requires all data to be sent to a single node, which then shuffles and orders the data.

I know about TABLESAMPLE but that allows one to sample some percentage of rows rather than some number of them. Is there a better way of doing this?

865

asked Jun 13 '17 00:06

RoyalTS

1 Answers

Athena is actually behind Presto. You can use TABLESAMPLE to get a random sample of your table.

Lets say you want 10% sample of your table, your query will be something like:

SELECT id FROM mytable TABLESAMPLE BERNOULLI(10)

Pay attention that there is BERNOULLI and SYSTEM sampling. Here is the documentation for it.

113

answered Sep 20 '22 15:09

Itay Kahana

Related questions
                            
                                Running sonar analysis with mvn sonar:sonar ignores sonar-project.properties
                            
                                Find out if a date is more than 30 days old
                            
                                How to convert stringified array into array in BigQuery?
                            
                                How to unit test a code snippet running inside executor service, instead waiting on Thread.sleep(time)
                            
                                React Native project, index.ios.js or index.android.js not generated
                            
                                React Native error message: Trying to add a root view with an explicit id already set
                            
                                How to read request body twice in Golang middleware?
                            
                                How to open local file on Jupyter?
                            
                                Warning: Can't resolve all parameters for UsersPermissionsService This will become an error in Angular v5.x
                            
                                Pandas groupby and aggregation output should include all the original columns (including the ones not aggregated on)
                            
                                Delete a field from Firestore, with a dynamic key
                            
                                Print exact value of PyTorch tensor (floating point precision)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With