Sorry that I don't have the resource to set up a cluster to test it, I'm just wondering to know:
Can I deploy hbase region server on a separated machine other than the hadoop data node machine? I guess the answer is yes, but I'm not sure.
Is it good or bad to deploy hbase region server and hadoop data node on different machines?
When putting some data into hbase, where is this data eventually stored in, data node or region server? I guess it's data node, but what is the StoreFile and HFile in region server, isn't it the physical file to store our data?
Thank you!
RegionServers should always run alongside DataNodes in distributed clusters if you want decent performance.
Very bad, that will work against the data locality principle (If you want to know a little more about data locality check this: http://www.larsgeorge.com/2010/05/hbase-file-locality-in-hdfs.html)
Actual data will be stored in the HDFS (DataNode), RegionServers are responsible of serving and managing regions.
For more information about HBase architecture please check this excelent post from Lars' blog: http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html
BTW, as long as you have a PC with decent RAM you can set up a demo cluster with virtual machines. Do not ever try to set up a production environment without properly test the platform first in a development environment.
To go in more detail about this answer:
- RegionServers should always run alongside? DataNodes in distributed clusters if you want decent performance."
I'm not sure how anyone would interpet the term alongside, so let's try to be even more precise:
So, in all Hadoop Distributions (eg Cloudera, MAPR, Hortonworks, others), the general best practice is that for HBase, the "RegionServers" are "co-located" with the "DataNodeServers".
This means that the actual slave (datanode) servers which form the HDFS cluster are each running the HDFS data-serving daemon (program) and they're also running the HBase region-serving daemon (program) as well!
This way we ensure locality - the concurrent processing and storing of data on all the individual nodes in an HDFS cluster, with no "movement" of gigantic loads of big data from "storage" locations to "processing" locations. Locality is vital to the success of a Hadoop cluster, such that HBase region servers (data nodes running the HBase daemon as well) must also do all their processing (putting/getting/scanning) on each data node containing the HFiles which make up HRegions which make up HTables which make up HBases (Hadoop-dataBases) ... .
So, servers (VMs or physical on Windows, Linux, ..) can run multiple daemons concurrently, often, they run dozens of them regularly.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With