How HDFS works when running Hadoop on a single node cluster?

Question

There is a lot of content explaining data locality and how MapReduce and HDFS works on multi-node clusters. But I can't find much information regarding a single node setup. In the past three months that I'm experimenting with Hadoop I'm always reading tutorials and threads regarding number of mappers and reducers and writing custom partitioners to optimize jobs, but I always think, does it apply to a single node cluster?

What is the loss of running MapReduce jobs on a single node cluster comparing to a multi-node cluster?

Does the parallelism that is provided by splitting the input data still applies in this case?

What's the difference of reading input from a single node HDFS and reading from the local filesystem?

I think due to my little experience I can't answer these questions clearly, so any help is appreciated!

Thanks in advance!

EDIT: I understand Hadoop is not suitable for a single node setup because of all the factors listed by @TC1. So, what's the benefit of setting up a pseudo-distributed Hadoop environment?

9 revs · Accepted Answer

I'm always reading tutorials and threads regarding number of mappers and reducers and writing custom partitioners to optimize jobs, but I always think, does it apply to a single node cluster?

It depends. Combiners are run between mapping and reducing and you'd definitely feel the impact even on a single node if they were used right. Custom partitioners -- probably no, the data hits the same disk before reducing. They would affect the logic, i.e., what data your reducers receive, but probably not the performance

What is the loss of running MapReduce jobs on a single node cluster comparing to a multi-node cluster?

Processing capability. If you can get by with a single node setup for your data, you probably shouldn't be using Hadoop for your processing in the first place.

Does the parallelism that is provided by splitting the input data still applies in this case?

No, the bottleneck typically is I/O, i.e., accessing the disk. In this case, you're still accessing the same disk, only hitting it from more threads.

What's the difference of reading input from a single node HDFS and reading from the local filesystem?

Virtually non-existent. The idea of HDFS is to
- store files in big, contiguous blocks, to avoid disk seeking
- replicate these blocks among the nodes to provide resilience;
both of those are moot when running on a single node.

EDIT:

The difference between "single-node" and "pseudo-distributed" is that in single-mode all the Hadoop processes run on a single JVM. There's no network communication involved, not even through localhost etc. Even if simply testing a job on small data, I'd advise to use pseudo-distributed since that is essentially the same as a cluster.

How HDFS works when running Hadoop on a single node cluster?

Tags:

hadoop

mapreduce

hdfs

João Melo

1 Answers

9 revs

Recent Activity

Donate For Us

How HDFS works when running Hadoop on a single node cluster?

Tags:

hadoop

mapreduce

hdfs

João Melo

1 Answers

9 revs

Related questions

Recent Activity

Donate For Us