I have access to a computer running Linux with 20 cores, 92 GB of RAM, and 100 GB of storage on an HDD. I would like to use Hadoop for a task involving a large amount of data (over 1M words, over 1B word combinations). Would pseudo-distributed mode or fully distributed mode be the best way to leverage the power of Hadoop on a single computer?
For my intended use of Hadoop, experiencing data loss and having to re-run the job because of node failure are not large issues.
This project involving Linux Containers uses fully distributed mode. This article describes pseudo-distributed mode; more detail can be found here.
There are two ways to install Hadoop, i.e. Single node and Multi-node. A single node cluster means only one DataNode running and setting up all the NameNode, DataNode, ResourceManager, and NodeManager on a single machine. This is used for studying and testing purposes.
With 100 DataNodes in a cluster, 64GB of RAM on the NameNode provides plenty of room to grow the cluster."
1 Answer. The ideal setup for running Hadoop operations are machines which have a dual core configuration (physical, preferably) and 4GB to 8GB servers/nodes which use ECC memory. Focusing on good memory specifications is important because HDFS running smoothly is very highly reliant on memory efficiency and robustness ...
As per my understanding, you have a single machine with 20 cores. In this case no need of virtualizing it, because the VMs that you create will consume some resources from the total resources. The best option is to install Linux OS in the laptop, install hadoop in pseudo distributed mode and configure the available resources for container allocation.
You need CPU cores as well as memory for getting good performance. So 20 cores alone will not help you. You need good amount of physical memory also. You can refer this document for allocating memory.
The fundamental behind hadoop is distributed computing and storage for processing large data in a cost effective manner. So if you try to achieve multiple machines in the same parent machine (small machines) by using virtualization, it won't help you because lot of resources will be consumed by the OS of individual machines. Instead if you install hadoop in the machine and configure the resources properly to hadoop, the jobs will execute in multiple containers (depending upon the availability and requirement) and hence parallel processing will happen. Thus you can achieve the maximum performance out of the existing machine.
So the best option is to set up a pseudo distributed cluster and allocate resources properly. Pseudo distributed mode is a mode in which all the daemons run in a single machine.
With the hardware configuration that you shared, you can use the below configuration for your hadoop set up. This can handle enough load.
(yarn-site.xml) yarn.nodemanager.resource.memory-mb = 81920
(yarn-site.xml) yarn.scheduler.minimum-allocation-mb = 1024
(yarn-site.xml) yarn.scheduler.maximum-allocation-mb = 81920
(yarn-site.xml) yarn.nodemanager.resource.cpu-vcores = 16
(yarn-site.xml) yarn.scheduler.minimum-allocation-vcores = 1
(yarn-site.xml) yarn.scheduler.increment-allocation-vcores = 1
(yarn-site.xml) yarn.scheduler.maximum-allocation-vcores = 16
(mapred-site.xml) mapreduce.map.memory.mb = 4096
(mapred-site.xml) mapreduce.reduce.memory.mb = 8192
(mapred-site.xml) mapreduce.map.java.opts = 3072
(mapred-site.xml) mapreduce.reduce.java.opts = 6144
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With