How can I pre split in hbase

Tags:

I am storing data in hbase having 5 region servers. I am using md5 hash of url as my row keys. Currently all the data is getting stored in one region server only. So I want to pre-split the regions so that data will go uniformly across all region server, so that data will go in each region server uniformly. I want to split data as first character of row key.As first character is from 0 to f(16 characters). Like data with rowkey starting from 0 to 3 will go in 1st region server, 3-6 on 2nd , 6-9 on 3rd, a-d on 4th, d-f on 5th. How can I do it ?

947

asked Jan 27 '15 08:01

Harsh Sharma

3 Answers

You can provide a SPLITS property when creating the table.

create 'tableName', 'cf1', {SPLITS => ['3','6','9','d']}

The 4 split points will generate 5 regions.

Please be noticed that HBase's DefaultLoadBalancer doesn't guarantee a 100% even distribution between regionservers, it could happen that a regionserver hosts multiple regions from the same table.

For more information about how it works take a look at this:

public List<RegionPlan> balanceCluster(Map<ServerName,List<HRegionInfo>> clusterState)
Generate a global load balancing plan according to the specified map of server information to the most loaded regions of each server. The load balancing invariant is that all servers are within 1 region of the average number of regions per server. If the average is an integer number, all servers will be balanced to the average. Otherwise, all servers will have either floor(average) or ceiling(average) regions. HBASE-3609 Modeled regionsToMove using Guava's MinMaxPriorityQueue so that we can fetch from both ends of the queue. At the beginning, we check whether there was empty region server just discovered by Master. If so, we alternately choose new / old regions from head / tail of regionsToMove, respectively. This alternation avoids clustering young regions on the newly discovered region server. Otherwise, we choose new regions from head of regionsToMove. Another improvement from HBASE-3609 is that we assign regions from regionsToMove to underloaded servers in round-robin fashion. Previously one underloaded server would be filled before we move onto the next underloaded server, leading to clustering of young regions. Finally, we randomly shuffle underloaded servers so that they receive offloaded regions relatively evenly across calls to balanceCluster(). The algorithm is currently implemented as such:

Determine the two valid numbers of regions each server should have, MIN=floor(average) and MAX=ceiling(average).

Iterate down the most loaded servers, shedding regions from each so each server hosts exactly MAX regions. Stop once you reach a server that already has <= MAX regions. Order the regions to move from most recent to least.

Iterate down the least loaded servers, assigning regions so each server has exactly MIN regions. Stop once you reach a server that already has >= MIN regions. Regions being assigned to underloaded servers are those that were shed in the previous step. It is possible that there were not enough regions shed to fill each underloaded server to MIN. If so we end up with a number of regions required to do so, neededRegions. It is also possible that we were able to fill each underloaded but ended up with regions that were unassigned from overloaded servers but that still do not have assignment. If neither of these conditions hold (no regions needed to fill the underloaded servers, no regions leftover from overloaded servers), we are done and return. Otherwise we handle these cases below.

If neededRegions is non-zero (still have underloaded servers), we iterate the most loaded servers again, shedding a single server from each (this brings them from having MAX regions to having MIN regions).

We now definitely have more regions that need assignment, either from the previous step or from the original shedding from overloaded servers. Iterate the least loaded servers filling each to MIN. If we still have more regions that need assignment, again iterate the least loaded servers, this time giving each one (filling them to MAX) until we run out.

All servers will now either host MIN or MAX regions. In addition, any server hosting >= MAX regions is guaranteed to end up with MAX regions at the end of the balancing. This ensures the minimal number of regions possible are moved.

TODO: We can at-most reassign the number of regions away from a particular server to be how many they report as most loaded. Should we just keep all assignment in memory? Any objections? Does this mean we need HeapSize on HMaster? Or just careful monitor? (current thinking is we will hold all assignments in memory)

answered Nov 01 '22 18:11

Rubén Moraleda

In case you are using Apache Phoenix for creating tables in HBase, you can specify SALT_BUCKETS in the CREATE statement. The table will split into as many regions as the bucket mentioned. Phoenix calculates the Hash of rowkey (most probably a numeric hash % SALT_BUCKETS) and assigns the column cell to the appropriate region.

CREATE TABLE IF NOT EXISTS us_population (
      state CHAR(2) NOT NULL,
      city VARCHAR NOT NULL,
      population BIGINT
      CONSTRAINT my_pk PRIMARY KEY (state, city)) SALT_BUCKETS=3;

This will pre-split the table into 3 regions

Alternatively, the HBase default UI, allows you to split regions accordingly. enter image description here

answered Nov 01 '22 17:11

Pratyay

If you have all the data have already been stored, I recommend you just move some regions to another region servers manually using hbase shell.

hbase> move ‘ENCODED_REGIONNAME’, ‘SERVER_NAME’

Move a region. Optionally specify target regionserver else we choose one at random. NOTE: You pass the encoded region name, not the region name so this command is a little different to the others. The encoded region name is the hash suffix on region names: e.g. if the region name were TestTable,0094429456,1289497600452.527db22f95c8a9e0116f0cc13c680396. then the encoded region name portion is 527db22f95c8a9e0116f0cc13c680396 A server name is its host, port plus startcode. For example: host187.example.com,60020,1289493121758

answered Nov 01 '22 17:11

Andrey Kudryavtsev

Related questions
                            
                                overwrite hive partitions using spark
                            
                                Global variables in hadoop
                            
                                A way to export the results from Pig to a database
                            
                                Find the average of numbers using MapReduce
                            
                                How to use Hadoop InputFormats In Apache Spark?
                            
                                Hadoop MapReduce: Clarification on number of reducers
                            
                                What is the difference between hadoop job -kill job_id and yarn application -kill application_id
                            
                                localhost: ERROR: Cannot set priority of datanode process 32156
                            
                                Hadoop on Kubernetes vs Standard Hadoop
                            
                                java.io.IOException: Incompatible clusterIDs
                            
                                how to order my tuple of spark results descending order using value
                            
                                Setting YARN queue in PySpark
                            
                                CAP with distributed System
                            
                                How to copy first few lines of a large file in hadoop to a new file?
                            
                                Could you give me any clue Why 'Cannot call methods on a stopped SparkContext'?
                            
                                How to find Hadoop hdfs directory on my system?
                            
                                Running jobs parallely in hadoop
                            
                                How to import org.apache Java dependencies w/ or w/o Maven
                            
                                dfs.namenode.servicerpc-address or dfs.namenode.rpc-address is not configured
                            
                                Is it possible to import data into Hive table without copying the data

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I pre split in hbase

Tags:

hadoop

hbase