Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

random sample of size N in Athena

Tags:

I'm trying to obtain a random sample of N rows from Athena. But since the table from which I want to draw this sample is huge the naive

SELECT
id
FROM mytable
ORDER BY RANDOM()
LIMIT 100

takes forever to run, presumably because the ORDER BY requires all data to be sent to a single node, which then shuffles and orders the data.

I know about TABLESAMPLE but that allows one to sample some percentage of rows rather than some number of them. Is there a better way of doing this?

like image 865
RoyalTS Avatar asked Jun 13 '17 00:06

RoyalTS


People also ask

Does Athena use Presto or Trino?

Functions in Athena engine version 2 are based on Presto 0.217 .

Is Athena based on Presto?

What is AWS Athena? AWS Athena is an interactive query service based on Presto that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage.


1 Answers

Athena is actually behind Presto. You can use TABLESAMPLE to get a random sample of your table.

Lets say you want 10% sample of your table, your query will be something like:

SELECT id FROM mytable TABLESAMPLE BERNOULLI(10)

Pay attention that there is BERNOULLI and SYSTEM sampling. Here is the documentation for it.

like image 113
Itay Kahana Avatar answered Sep 20 '22 15:09

Itay Kahana