How to Best Run Hadoop on Single Machine?

Tags:

I have access to a computer running Linux with 20 cores, 92 GB of RAM, and 100 GB of storage on an HDD. I would like to use Hadoop for a task involving a large amount of data (over 1M words, over 1B word combinations). Would pseudo-distributed mode or fully distributed mode be the best way to leverage the power of Hadoop on a single computer?

For my intended use of Hadoop, experiencing data loss and having to re-run the job because of node failure are not large issues.

This project involving Linux Containers uses fully distributed mode. This article describes pseudo-distributed mode; more detail can be found here.

308

asked Jul 30 '15 20:07

AST

1 Answers

As per my understanding, you have a single machine with 20 cores. In this case no need of virtualizing it, because the VMs that you create will consume some resources from the total resources. The best option is to install Linux OS in the laptop, install hadoop in pseudo distributed mode and configure the available resources for container allocation.

You need CPU cores as well as memory for getting good performance. So 20 cores alone will not help you. You need good amount of physical memory also. You can refer this document for allocating memory.

The fundamental behind hadoop is distributed computing and storage for processing large data in a cost effective manner. So if you try to achieve multiple machines in the same parent machine (small machines) by using virtualization, it won't help you because lot of resources will be consumed by the OS of individual machines. Instead if you install hadoop in the machine and configure the resources properly to hadoop, the jobs will execute in multiple containers (depending upon the availability and requirement) and hence parallel processing will happen. Thus you can achieve the maximum performance out of the existing machine.

So the best option is to set up a pseudo distributed cluster and allocate resources properly. Pseudo distributed mode is a mode in which all the daemons run in a single machine.

With the hardware configuration that you shared, you can use the below configuration for your hadoop set up. This can handle enough load.

(yarn-site.xml)    yarn.nodemanager.resource.memory-mb  = 81920
(yarn-site.xml)    yarn.scheduler.minimum-allocation-mb = 1024
(yarn-site.xml)    yarn.scheduler.maximum-allocation-mb = 81920
(yarn-site.xml)    yarn.nodemanager.resource.cpu-vcores = 16
(yarn-site.xml)    yarn.scheduler.minimum-allocation-vcores = 1
(yarn-site.xml)    yarn.scheduler.increment-allocation-vcores = 1
(yarn-site.xml)    yarn.scheduler.maximum-allocation-vcores = 16
(mapred-site.xml)  mapreduce.map.memory.mb  = 4096
(mapred-site.xml)  mapreduce.reduce.memory.mb   = 8192
(mapred-site.xml)  mapreduce.map.java.opts  = 3072
(mapred-site.xml)  mapreduce.reduce.java.opts   = 6144

163

answered Sep 23 '22 08:09

Amal G Jose

Related questions
                            
                                Loading data with Hive, S3, EMR, and Recover Partitions
                            
                                Is there ln in hadoop HDFS
                            
                                zookeeper client does not provide CLI with "jline support is disabled" message
                            
                                HBase - What's the difference between WAL and MemStore?
                            
                                Configuring Hadoop logging to avoid too many log files
                            
                                Still getting "Unable to load realm info from SCDynamicStore" after bug fix
                            
                                Hadoop Yarn Container Does Not Allocate Enough Space
                            
                                What is Keyword Context in Hadoop programming world?
                            
                                How do I test if R is running as Rscript?
                            
                                What is most efficient way to write from kafka to hdfs with files partitioning into dates
                            
                                "No Filesystem for Scheme: gs" when running spark job locally
                            
                                How are containers created based on vcores and memory in MapReduce2?
                            
                                Clojure futures in context of Scala's concurrency models
                            
                                Hadoop: java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected
                            
                                Hadoop Mapreduce Error Input path does not exist: hdfs://localhost:54310/user/hduser/input"
                            
                                Can I use Spark without Hadoop for development environment?
                            
                                What does "Client" exactly mean for Hadoop / HDFS?
                            
                                Can I submit an oozie job with multiple configuration files?
                            
                                Reading file as single record in hadoop
                            
                                Which HDFS operations are atomic?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to Best Run Hadoop on Single Machine?

Tags:

virtual-machine

parallel-processing

linux-containers

processing-efficiency

hadoop

AST

People also ask

1 Answers

Amal G Jose

Recent Activity

Donate For Us