Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How HDFS works when running Hadoop on a single node cluster?

There is a lot of content explaining data locality and how MapReduce and HDFS works on multi-node clusters. But I can't find much information regarding a single node setup. In the past three months that I'm experimenting with Hadoop I'm always reading tutorials and threads regarding number of mappers and reducers and writing custom partitioners to optimize jobs, but I always think, does it apply to a single node cluster?

What is the loss of running MapReduce jobs on a single node cluster comparing to a multi-node cluster?

Does the parallelism that is provided by splitting the input data still applies in this case?

What's the difference of reading input from a single node HDFS and reading from the local filesystem?

I think due to my little experience I can't answer these questions clearly, so any help is appreciated!

Thanks in advance!

EDIT: I understand Hadoop is not suitable for a single node setup because of all the factors listed by @TC1. So, what's the benefit of setting up a pseudo-distributed Hadoop environment?

like image 301
João Melo Avatar asked Oct 21 '22 20:10

João Melo


1 Answers

I'm always reading tutorials and threads regarding number of mappers and reducers and writing custom partitioners to optimize jobs, but I always think, does it apply to a single node cluster?

  • It depends. Combiners are run between mapping and reducing and you'd definitely feel the impact even on a single node if they were used right. Custom partitioners -- probably no, the data hits the same disk before reducing. They would affect the logic, i.e., what data your reducers receive, but probably not the performance

What is the loss of running MapReduce jobs on a single node cluster comparing to a multi-node cluster?

  • Processing capability. If you can get by with a single node setup for your data, you probably shouldn't be using Hadoop for your processing in the first place.

Does the parallelism that is provided by splitting the input data still applies in this case?

  • No, the bottleneck typically is I/O, i.e., accessing the disk. In this case, you're still accessing the same disk, only hitting it from more threads.

What's the difference of reading input from a single node HDFS and reading from the local filesystem?

  • Virtually non-existent. The idea of HDFS is to

    • store files in big, contiguous blocks, to avoid disk seeking
    • replicate these blocks among the nodes to provide resilience;

    both of those are moot when running on a single node.

EDIT:

The difference between "single-node" and "pseudo-distributed" is that in single-mode all the Hadoop processes run on a single JVM. There's no network communication involved, not even through localhost etc. Even if simply testing a job on small data, I'd advise to use pseudo-distributed since that is essentially the same as a cluster.

like image 125
9 revs Avatar answered Oct 24 '22 11:10

9 revs