I have read that Presto is meant for ad-hoc querying and Hive/spark are more for apt for ETL scenarios. It seems the reason not to use Presto in ETL is because Presto queries can fail and there is no mid-query fault tolerance.
However, it looks like we can get around it by using Presto in the our daily Jenkins workflow along with retry in case of query failures. Has anybody tried using this way or are their any downsides to this approach?
If you are using Presto in ETL, how big were your Presto clusters ? What kind of EC2 instances you use for your presto cluster?
Here's a look at how Presto on Apache Spark enables organizations to combine AI processes with ETL (extract, transform, and load) to enable data analytics. Artificial Intelligence (AI) has had a major influence on business intelligence and operational processing.
Presto is designed for faster query processing when compared to Hive, so it sacrifices fault-tolerance, to some extent. Typically, more than 99.5% of Presto queries finish without any error on the first run.
Presto provides an additional compute layer for faster analytics. It doesn't store the data, which gives it the massive advantage of being able to scale resources for queries up and down f based on the demand. This compute and storage separation makes the Presto query engine extremely suitable for cloud environments.
Presto stores intermediate data during the period of tasks in its buffer cache. However, it is not meant to serve as a caching solution or a persistent storage layer. It is primarily designed to be a query execution engine that allows you to query against other disparate data sources.
Provided your ETL jobs are not very long or complex (i.e., standard SQL is sufficient for the transformations needed), I think Presto can do a reasonable job. As you pointed out, there is no mid-query fault tolerance so you need a mechanism to restart your queries upon failure. Hopefully, the speed of Presto will offset an occasional restart. One additional strategy is to break long complex queries into a series of shorter/simpler ones and create temporary tables in between to effectively achieve manual checkpointing. Such strategy was leveraged by Facebook when they migrated some of their batch Hive jobs to Presto: https://www.slideshare.net/kbajda/presto-at-hadoop-summit-2016
One additional recommendation I would make is to spin a separate Presto cluster for ETL to avoid any resource contention with your interactive Presto workload.
In terms of instance types, it obviously depends on your queries. Most often you want a good balance of RAM and CPU. Starting with R4 instance type is a good bet. Once you observe your workload at runtime, you can either add more nodes to speed up the ETL process or explore other instance types (e.g., if CPU is fully loaded, moving to C4/5 instance type might be a good bet).
More generally the Presto-Users mailing list is a good source of information: https://groups.google.com/group/presto-users. Also, learning from the community members at events such as Presto Summit (https://www.starburstdata.com/technical-blog/presto-summit-2018-recap/).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With