Presto has multi connectors. While the connectors do implement read and write operations, from all the tutorials I read, it seems they are typically used as data sources to read from only. For example, netflix has "10 petabyte" of data on Amazon S3 and they explicitly state that no disk (and no HDFS) is used on the Presto worker nodes. The stated use case are "ad hoc interactive" queries.
Also, Amazon Athena is essentially S3+Presto and comes with similar use cases.
I'm puzzled how this can work in practice. Obviously, you don't want to read 10 PB of data on every query. So I assume, you want to keep some previously fetched data in memory, such as a database index. However, with no constraints on the data and the queries, I fail to understand how this can be efficient.
Use case 1: I run the same query frequently, e.g. to show metric on a dashboard. Does Presto avoid rescanning the data points which are already 'known'?
Use case 2: I'm analysing a large data set. Each query is slightly different, however there are common subqueries or we filter to a common subset of the data. Does Presto learn from previous queries and carry over intermediate results?
Or, if this is not the case, would I be well advised to store intermediate results somewhere (e.g. CREATE TABLE AS ...)?
There is no data caching tier for Presto itself. To be honest, I don't think the features you are proposing here are supposed to be provided by Presto as a SQL analytics engine. For both of use cases you mentioned, I suggest deploying Alluxio together with Presto as a caching layer to help:
Use case 1: I run the same query frequently, e.g. to show metric on a dashboard. Does Presto avoid rescanning the data points which are already 'known'?
As a caching layer, Alluxio can detect the data access pattern from Presto (or other applications) and make caching/eviction decisions to serve the most frequently used data in a memory tier (up to your configuration, can be SSD or HDD too). This will help when the data access is not consistent.
Use case 2: I'm analysing a large data set. Each query is slightly different, however there are common subqueries or we filter to a common subset of the data. Does Presto learn from previous queries and carry over intermediate results?
With more knowledge on your input data, you can enforce data policies in Alluxio to (1) preload data (common subqueries) into the caching space, (2) set TTL to retire data from Alluxio caching space to make room for other hot data, (3) set caching policies on certain input path (e.g., CACHE on certain paths, NO CACHE on some other paths).
Checkout more tips to run Presto/Alluxio stack: https://www.alluxio.io/blog/top-5-performance-tuning-tips-for-running-presto-on-alluxio-1/
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With