Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can I use Athena / Presto to sort a table before writing?

I want to archive my logs into the Parquet format. Before writing the table, I want to sort it by a column c so that each Parquet file will only have a small range of c. That will allow Athena / Presto to efficiently scan the table when a query includes a WHERE clause on column c (via predicate pushdown).

However, it's unclear to me whether I can use Athena or Presto to sort the entire table. I need a distributed sort - not one that takes place on a single node - because the dataset is too big to fit on a single node. Is such a sort possible? If so, how to I invoke it?

like image 613
conradlee Avatar asked Nov 22 '25 12:11

conradlee


1 Answers

Presto supports distributed sort since 0.206. Athena is currently based on Presto 0.172 and I don't know if they backported this feature.

So your choices are

  • grab latest Presto @ https://trino.io/download.html
  • get easy to deploy Presto on AWS from Starburst (https://www.starburstdata.com/presto-aws-cloud/) (disclaimer: I am from Starburst)
  • use Presto bundled on EMR (I don't know how it comes configured, but probably Distributed Sort is still enabled by default)
like image 193
Piotr Findeisen Avatar answered Nov 25 '25 11:11

Piotr Findeisen



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!