Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hardware requirements for Presto

Tags:

presto

trino

I suspect the answer is "it depends", but is there any general guidance about what kind of hardware to plan to use for Presto?

Since Presto uses a coordinator and a set of workers, and workers run with the data, I imagine the main issues will be having sufficient RAM for the coordinator, sufficient network bandwidth for partial results sent from workers to the coordinator, etc.

If you can supply some general thoughts on how to size for this appropriately, I'd love to hear them.

like image 419
benvolioT Avatar asked Nov 08 '13 16:11

benvolioT


1 Answers

Most people are running Trino (formerly PrestoSQL) on the Hadoop nodes they already have. At Facebook we typically run Presto on a few nodes within the Hadoop cluster to spread out the network load.

Generally, I'd go with the industry standard ratios for a new cluster: 2 cores and 2-4 gig of memory for each disk, with 10 gigabit networking if you can afford it. After you have a few machines (4+), benchmark using your queries on your data. It should be obvious if you need to adjust the ratios.

In terms of sizing the hardware for a cluster from scratch some things to consider:

  • Total data size will determine the number of disks you will need. HDFS has a large overhead so you will need lots of disks.
  • The ratio of CPU speed to disks depends on the ratio between hot data (the data you are working with) and the cold data (archive data). If you just starting your data warehouse you will need lots of CPUs since all the data will be new and hot. On the other hand, most physical disks can only deliver data so fast, so at some point more CPUs don't help.
  • The ratio of CPU speed to memory depends on the size of aggregations and joins you want to perform and the amount of (hot) data you want to cache. Currently, Presto requires the final aggregation results and the hash table for a join to fit in memory on a single machine (we're actively working on removing these restrictions). If you have larger amounts of memory, the OS will cache disk pages which will significantly improve the performance of queries.

In 2013 at Facebook we ran our Presto processes as follows:

  • We ran our JVMs with a 16 GB heap to leave most memory available for OS buffers
  • On the machines we ran Presto we didn't run MapReduce tasks.
  • Most of the Presto machines had 16 real cores and used processor affinity (eventually cgroups) to limit Presto to 12 cores (so the Hadoop data node process and other things could run easily).
  • Most of the servers were on a 10 gigabit networks, but we did have one large old crufty cluster using 1 gigabit (which worked fine).
  • We used the same configuration for the coordinator and the workers.

In recent times, we ran the following:

  • The machines had 256 GB of memory and we ran a 200 GB Java heap
  • Most of the machines had 24-32 real cores and Presto was allocated all cores.
  • The machines had only minimal local storage for logs, with all table data remote (in a proprietary distributed file system).
  • Most servers had a 25 gigabit network connection to a fabric network.
  • The coordinators and workers had similar configurations.
like image 172
Dain Sundstrom Avatar answered Oct 21 '22 07:10

Dain Sundstrom