Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

HBase shell scan bytes to string conversion

Tags:

hbase

jruby

I would like to scan hbase table and see integers as strings (not their binary representation). I can do the conversion but have no idea how to write scan statement by using Java API from hbase shell:

org.apache.hadoop.hbase.util.Bytes.toString(
  "\x48\x65\x6c\x6c\x6f\x20\x48\x42\x61\x73\x65".to_java_bytes)

 org.apache.hadoop.hbase.util.Bytes.toString("Hello HBase".to_java_bytes)

I will be very happy to have examples of scan, get that searching binary data (long's) and output normal strings. I am using hbase shell, not JAVA.

like image 804
Yuri Levinsky Avatar asked Jul 18 '13 14:07

Yuri Levinsky


People also ask

What is the difference between scan and get in HBase?

When you compare a partial key scan and a get, remember that the row key you use for Get can be a much longer string than the partial key you use for the scan. In that case, for the Get, HBase has to do a deterministic lookup to ascertain the exact location of the row key that it needs to match and fetch it.

What is scan in HBase?

Scaning using HBase Shell The scan command is used to view the data in HTable. Using the scan command, you can get the table data. Its syntax is as follows: scan '<table name>'


3 Answers

HBase stores data as byte arrays (untyped). Therefore if you perform a table scan data will be displayed in a common format (escaped hexadecimal string), e.g:
"\x48\x65\x6c\x6c\x6f\x20\x48\x42\x61\x73\x65" -> Hello HBase

If you want to get back the typed value from the serialized byte array you have to do this manually. You have the following options:

  • Java code (Bytes.toString(...))
  • hack the to_string function in $HBASE/HOME/lib/ruby/hbase/table.rb : replace toStringBinary with toInt for non-meta tables
  • write a get/scan JRuby function which converts the byte array to the appropriate type

Since you want it HBase shell, then consider the last option:
Create a file get_result.rb :

import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.client.HTable
import org.apache.hadoop.hbase.client.Scan;
import org.apache.hadoop.hbase.util.Bytes;
import org.apache.hadoop.hbase.client.ResultScanner;
import org.apache.hadoop.hbase.client.Result;
import java.util.ArrayList;

# Simple function equivalent to scan 'test', {COLUMNS => 'c:c2'}
def get_result()
  htable = HTable.new(HBaseConfiguration.new, "test")
  rs = htable.getScanner(Bytes.toBytes("c"), Bytes.toBytes("c2"))
  output = ArrayList.new
  output.add "ROW\t\t\t\t\t\tCOLUMN\+CELL"
  rs.each { |r| 
    r.raw.each { |kv|
      row = Bytes.toString(kv.getRow)
      fam = Bytes.toString(kv.getFamily)
      ql = Bytes.toString(kv.getQualifier)
      ts = kv.getTimestamp
      val = Bytes.toInt(kv.getValue)
      output.add " #{row} \t\t\t\t\t\t column=#{fam}:#{ql}, timestamp=#{ts}, value=#{val}"
    }
  }
  output.each {|line| puts "#{line}\n"}
end

load it in the HBase shell and use it:

require '/path/to/get_result'
get_result

Note: modify/enhance/fix the code according to your needs

like image 141
Lorand Bendig Avatar answered Nov 01 '22 04:11

Lorand Bendig


Just for completeness' sake, it turns out that the call Bytes::toStringBinary gives the hex-escaped sequence you get in HBase shell:

\x0B\x2_SOME_ASCII_TEXT_\x10\x00...

Whereas, Bytes::toString will try to deserialize to a string assuming UTF8, which will look more like:

\u8900\u0710\u0115\u0320\u0000_SOME_UTF8_TEXT_\u4009...

like image 3
kdawg Avatar answered Nov 01 '22 04:11

kdawg


you can add a scan_counter command to the hbase shell.

first:

add to /usr/lib/hbase/lib/ruby/hbase/table.rb (after the scan function):

#----------------------------------------------------------------------------------------------
  # Scans whole table or a range of keys and returns rows matching specific criterias with values as number
  def scan_counter(args = {})
    unless args.kind_of?(Hash)
      raise ArgumentError, "Arguments should be a hash. Failed to parse #{args.inspect}, #{args.class}"
    end

    limit = args.delete("LIMIT") || -1
    maxlength = args.delete("MAXLENGTH") || -1

    if args.any?
      filter = args["FILTER"]
      startrow = args["STARTROW"] || ''
      stoprow = args["STOPROW"]
      timestamp = args["TIMESTAMP"]
      columns = args["COLUMNS"] || args["COLUMN"] || get_all_columns
      cache = args["CACHE_BLOCKS"] || true
      versions = args["VERSIONS"] || 1
      timerange = args[TIMERANGE]

      # Normalize column names
      columns = [columns] if columns.class == String
      unless columns.kind_of?(Array)
        raise ArgumentError.new("COLUMNS must be specified as a String or an Array")
      end

      scan = if stoprow
        org.apache.hadoop.hbase.client.Scan.new(startrow.to_java_bytes, stoprow.to_java_bytes)
      else
        org.apache.hadoop.hbase.client.Scan.new(startrow.to_java_bytes)
      end

      columns.each { |c| scan.addColumns(c) }
      scan.setFilter(filter) if filter
      scan.setTimeStamp(timestamp) if timestamp
      scan.setCacheBlocks(cache)
      scan.setMaxVersions(versions) if versions > 1
      scan.setTimeRange(timerange[0], timerange[1]) if timerange
    else
      scan = org.apache.hadoop.hbase.client.Scan.new
    end

    # Start the scanner
    scanner = @table.getScanner(scan)
    count = 0
    res = {}
    iter = scanner.iterator

    # Iterate results
    while iter.hasNext
      if limit > 0 && count >= limit
        break
      end

      row = iter.next
      key = org.apache.hadoop.hbase.util.Bytes::toStringBinary(row.getRow)

      row.list.each do |kv|
        family = String.from_java_bytes(kv.getFamily)
        qualifier = org.apache.hadoop.hbase.util.Bytes::toStringBinary(kv.getQualifier)

        column = "#{family}:#{qualifier}"
        cell = to_string_scan_counter(column, kv, maxlength)

        if block_given?
          yield(key, "column=#{column}, #{cell}")
        else
          res[key] ||= {}
          res[key][column] = cell
        end
      end

      # One more row processed
      count += 1
    end

    return ((block_given?) ? count : res)
  end

  #----------------------------------------------------------------------------------------
  # Helper methods

  # Returns a list of column names in the table
  def get_all_columns
    @table.table_descriptor.getFamilies.map do |family|
      "#{family.getNameAsString}:"
    end
  end

  # Checks if current table is one of the 'meta' tables
  def is_meta_table?
    tn = @table.table_name
    org.apache.hadoop.hbase.util.Bytes.equals(tn, org.apache.hadoop.hbase.HConstants::META_TABLE_NAME) || org.apache.hadoop.hbase.util.Bytes.equals(tn, org.apache.hadoop.hbase.HConstants::ROOT_TABLE_NAME)
  end

  # Returns family and (when has it) qualifier for a column name
  def parse_column_name(column)
    split = org.apache.hadoop.hbase.KeyValue.parseColumn(column.to_java_bytes)
    return split[0], (split.length > 1) ? split[1] : nil
  end

  # Make a String of the passed kv
  # Intercept cells whose format we know such as the info:regioninfo in .META.
  def to_string(column, kv, maxlength = -1)
    if is_meta_table?
      if column == 'info:regioninfo' or column == 'info:splitA' or column == 'info:splitB'
        hri = org.apache.hadoop.hbase.util.Writables.getHRegionInfoOrNull(kv.getValue)
        return "timestamp=%d, value=%s" % [kv.getTimestamp, hri.toString]
      end
      if column == 'info:serverstartcode'
        if kv.getValue.length > 0
          str_val = org.apache.hadoop.hbase.util.Bytes.toLong(kv.getValue)
        else
          str_val = org.apache.hadoop.hbase.util.Bytes.toStringBinary(kv.getValue)
        end
        return "timestamp=%d, value=%s" % [kv.getTimestamp, str_val]
      end
    end

    val = "timestamp=#{kv.getTimestamp}, value=#{org.apache.hadoop.hbase.util.Bytes::toStringBinary(kv.getValue)}"
    (maxlength != -1) ? val[0, maxlength] : val
  end


  def to_string_scan_counter(column, kv, maxlength = -1)
    if is_meta_table?
      if column == 'info:regioninfo' or column == 'info:splitA' or column == 'info:splitB'
        hri = org.apache.hadoop.hbase.util.Writables.getHRegionInfoOrNull(kv.getValue)
        return "timestamp=%d, value=%s" % [kv.getTimestamp, hri.toString]
      end
      if column == 'info:serverstartcode'
        if kv.getValue.length > 0
          str_val = org.apache.hadoop.hbase.util.Bytes.toLong(kv.getValue)
        else
          str_val = org.apache.hadoop.hbase.util.Bytes.toStringBinary(kv.getValue)
        end
        return "timestamp=%d, value=%s" % [kv.getTimestamp, str_val]
      end
    end

    val = "timestamp=#{kv.getTimestamp}, value=#{org.apache.hadoop.hbase.util.Bytes::toLong(kv.getValue)}"
    (maxlength != -1) ? val[0, maxlength] : val
  end

second:

add to /usr/lib/hbase/lib/ruby/shell/commands/ the following file called: scan_counter.rb

  #
# Copyright 2010 The Apache Software Foundation
#
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

module Shell
  module Commands
    class ScanCounter < Command
      def help
        return <<-EOF
Scan a table with cell value that is long; pass table name and optionally a dictionary of scanner
specifications.  Scanner specifications may include one or more of:
TIMERANGE, FILTER, LIMIT, STARTROW, STOPROW, TIMESTAMP, MAXLENGTH,
or COLUMNS. If no columns are specified, all columns will be scanned.
To scan all members of a column family, leave the qualifier empty as in
'col_family:'.

Some examples:

  hbase> scan_counter '.META.'
  hbase> scan_counter '.META.', {COLUMNS => 'info:regioninfo'}
  hbase> scan_counter 't1', {COLUMNS => ['c1', 'c2'], LIMIT => 10, STARTROW => 'xyz'}
  hbase> scan_counter 't1', {FILTER => org.apache.hadoop.hbase.filter.ColumnPaginationFilter.new(1, 0)}
  hbase> scan_counter 't1', {COLUMNS => 'c1', TIMERANGE => [1303668804, 1303668904]}

For experts, there is an additional option -- CACHE_BLOCKS -- which
switches block caching for the scanner on (true) or off (false).  By
default it is enabled.  Examples:

  hbase> scan_counter 't1', {COLUMNS => ['c1', 'c2'], CACHE_BLOCKS => false}
EOF
      end

      def command(table, args = {})
        now = Time.now
        formatter.header(["ROW", "COLUMN+CELL"])

        count = table(table).scan_counter(args) do |row, cells|
          formatter.row([ row, cells ])
        end

        formatter.footer(now, count)
      end
    end
  end
end

finally

add to /usr/lib/hbase/lib/ruby/shell.rb the function scan_counter.

replace the current function with this: (you can identify it by: 'DATA MANIPULATION COMMANDS',)

Shell.load_command_group(
  'dml',
  :full_name => 'DATA MANIPULATION COMMANDS',
  :commands => %w[
    count
    delete
    deleteall
    get
    get_counter
    incr
    put
    scan
    scan_counter
    truncate
  ]
)
like image 2
Udy Avatar answered Nov 01 '22 05:11

Udy