Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Should the HBase region server and Hadoop data node on the same machine?

Tags:

hadoop

hbase

Sorry that I don't have the resource to set up a cluster to test it, I'm just wondering to know:

  1. Can I deploy hbase region server on a separated machine other than the hadoop data node machine? I guess the answer is yes, but I'm not sure.

  2. Is it good or bad to deploy hbase region server and hadoop data node on different machines?

  3. When putting some data into hbase, where is this data eventually stored in, data node or region server? I guess it's data node, but what is the StoreFile and HFile in region server, isn't it the physical file to store our data?

Thank you!

like image 940
gfytd Avatar asked Jan 06 '15 10:01

gfytd


2 Answers

  1. RegionServers should always run alongside DataNodes in distributed clusters if you want decent performance.

  2. Very bad, that will work against the data locality principle (If you want to know a little more about data locality check this: http://www.larsgeorge.com/2010/05/hbase-file-locality-in-hdfs.html)

  3. Actual data will be stored in the HDFS (DataNode), RegionServers are responsible of serving and managing regions.

For more information about HBase architecture please check this excelent post from Lars' blog: http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html

BTW, as long as you have a PC with decent RAM you can set up a demo cluster with virtual machines. Do not ever try to set up a production environment without properly test the platform first in a development environment.

like image 127
Rubén Moraleda Avatar answered Oct 18 '22 05:10

Rubén Moraleda


To go in more detail about this answer:

  1. RegionServers should always run alongside? DataNodes in distributed clusters if you want decent performance."

I'm not sure how anyone would interpet the term alongside, so let's try to be even more precise:

  1. What makes any physical server an "XYZ" server is that it's running a program called a daemon (think "eternally-running background event-handling" program);
  2. What makes a "file" server is that it's running a file-serving daemon;
  3. What makes a "web" server is that it's running a web-serving daemon; AND
  4. What makes a "data node" server is that it's running the HDFS data-serving daemon;
  5. What makes a "region" server then is that it's running the HBase region-serving daemon (program);

So, in all Hadoop Distributions (eg Cloudera, MAPR, Hortonworks, others), the general best practice is that for HBase, the "RegionServers" are "co-located" with the "DataNodeServers".

This means that the actual slave (datanode) servers which form the HDFS cluster are each running the HDFS data-serving daemon (program) and they're also running the HBase region-serving daemon (program) as well!

This way we ensure locality - the concurrent processing and storing of data on all the individual nodes in an HDFS cluster, with no "movement" of gigantic loads of big data from "storage" locations to "processing" locations. Locality is vital to the success of a Hadoop cluster, such that HBase region servers (data nodes running the HBase daemon as well) must also do all their processing (putting/getting/scanning) on each data node containing the HFiles which make up HRegions which make up HTables which make up HBases (Hadoop-dataBases) ... .

So, servers (VMs or physical on Windows, Linux, ..) can run multiple daemons concurrently, often, they run dozens of them regularly.

like image 20
Mark Vogt Avatar answered Oct 18 '22 05:10

Mark Vogt