HBase shell scan bytes to string conversion

Tags:

I would like to scan hbase table and see integers as strings (not their binary representation). I can do the conversion but have no idea how to write scan statement by using Java API from hbase shell:

org.apache.hadoop.hbase.util.Bytes.toString(
  "\x48\x65\x6c\x6c\x6f\x20\x48\x42\x61\x73\x65".to_java_bytes)

 org.apache.hadoop.hbase.util.Bytes.toString("Hello HBase".to_java_bytes)

I will be very happy to have examples of scan, get that searching binary data (long's) and output normal strings. I am using hbase shell, not JAVA.

804

asked Jul 18 '13 14:07

Yuri Levinsky

3 Answers

HBase stores data as byte arrays (untyped). Therefore if you perform a table scan data will be displayed in a common format (escaped hexadecimal string), e.g:
"\x48\x65\x6c\x6c\x6f\x20\x48\x42\x61\x73\x65" -> Hello HBase

If you want to get back the typed value from the serialized byte array you have to do this manually. You have the following options:

Java code (Bytes.toString(...))
hack the to_string function in $HBASE/HOME/lib/ruby/hbase/table.rb : replace toStringBinary with toInt for non-meta tables
write a get/scan JRuby function which converts the byte array to the appropriate type

Since you want it HBase shell, then consider the last option:
Create a file get_result.rb :

import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.client.HTable
import org.apache.hadoop.hbase.client.Scan;
import org.apache.hadoop.hbase.util.Bytes;
import org.apache.hadoop.hbase.client.ResultScanner;
import org.apache.hadoop.hbase.client.Result;
import java.util.ArrayList;

# Simple function equivalent to scan 'test', {COLUMNS => 'c:c2'}
def get_result()
  htable = HTable.new(HBaseConfiguration.new, "test")
  rs = htable.getScanner(Bytes.toBytes("c"), Bytes.toBytes("c2"))
  output = ArrayList.new
  output.add "ROW\t\t\t\t\t\tCOLUMN\+CELL"
  rs.each { |r| 
    r.raw.each { |kv|
      row = Bytes.toString(kv.getRow)
      fam = Bytes.toString(kv.getFamily)
      ql = Bytes.toString(kv.getQualifier)
      ts = kv.getTimestamp
      val = Bytes.toInt(kv.getValue)
      output.add " #{row} \t\t\t\t\t\t column=#{fam}:#{ql}, timestamp=#{ts}, value=#{val}"
    }
  }
  output.each {|line| puts "#{line}\n"}
end

load it in the HBase shell and use it:

require '/path/to/get_result'
get_result

Note: modify/enhance/fix the code according to your needs

141

answered Nov 01 '22 04:11

Lorand Bendig

Just for completeness' sake, it turns out that the call Bytes::toStringBinary gives the hex-escaped sequence you get in HBase shell:

\x0B\x2_SOME_ASCII_TEXT_\x10\x00...

Whereas, Bytes::toString will try to deserialize to a string assuming UTF8, which will look more like:

\u8900\u0710\u0115\u0320\u0000_SOME_UTF8_TEXT_\u4009...

answered Nov 01 '22 04:11

kdawg

you can add a scan_counter command to the hbase shell.

first:

add to /usr/lib/hbase/lib/ruby/hbase/table.rb (after the scan function):

#----------------------------------------------------------------------------------------------
  # Scans whole table or a range of keys and returns rows matching specific criterias with values as number
  def scan_counter(args = {})
    unless args.kind_of?(Hash)
      raise ArgumentError, "Arguments should be a hash. Failed to parse #{args.inspect}, #{args.class}"
    end

    limit = args.delete("LIMIT") || -1
    maxlength = args.delete("MAXLENGTH") || -1

    if args.any?
      filter = args["FILTER"]
      startrow = args["STARTROW"] || ''
      stoprow = args["STOPROW"]
      timestamp = args["TIMESTAMP"]
      columns = args["COLUMNS"] || args["COLUMN"] || get_all_columns
      cache = args["CACHE_BLOCKS"] || true
      versions = args["VERSIONS"] || 1
      timerange = args[TIMERANGE]

      # Normalize column names
      columns = [columns] if columns.class == String
      unless columns.kind_of?(Array)
        raise ArgumentError.new("COLUMNS must be specified as a String or an Array")
      end

      scan = if stoprow
        org.apache.hadoop.hbase.client.Scan.new(startrow.to_java_bytes, stoprow.to_java_bytes)
      else
        org.apache.hadoop.hbase.client.Scan.new(startrow.to_java_bytes)
      end

      columns.each { |c| scan.addColumns(c) }
      scan.setFilter(filter) if filter
      scan.setTimeStamp(timestamp) if timestamp
      scan.setCacheBlocks(cache)
      scan.setMaxVersions(versions) if versions > 1
      scan.setTimeRange(timerange[0], timerange[1]) if timerange
    else
      scan = org.apache.hadoop.hbase.client.Scan.new
    end

    # Start the scanner
    scanner = @table.getScanner(scan)
    count = 0
    res = {}
    iter = scanner.iterator

    # Iterate results
    while iter.hasNext
      if limit > 0 && count >= limit
        break
      end

      row = iter.next
      key = org.apache.hadoop.hbase.util.Bytes::toStringBinary(row.getRow)

      row.list.each do |kv|
        family = String.from_java_bytes(kv.getFamily)
        qualifier = org.apache.hadoop.hbase.util.Bytes::toStringBinary(kv.getQualifier)

        column = "#{family}:#{qualifier}"
        cell = to_string_scan_counter(column, kv, maxlength)

        if block_given?
          yield(key, "column=#{column}, #{cell}")
        else
          res[key] ||= {}
          res[key][column] = cell
        end
      end

      # One more row processed
      count += 1
    end

    return ((block_given?) ? count : res)
  end

  #----------------------------------------------------------------------------------------
  # Helper methods

  # Returns a list of column names in the table
  def get_all_columns
    @table.table_descriptor.getFamilies.map do |family|
      "#{family.getNameAsString}:"
    end
  end

  # Checks if current table is one of the 'meta' tables
  def is_meta_table?
    tn = @table.table_name
    org.apache.hadoop.hbase.util.Bytes.equals(tn, org.apache.hadoop.hbase.HConstants::META_TABLE_NAME) || org.apache.hadoop.hbase.util.Bytes.equals(tn, org.apache.hadoop.hbase.HConstants::ROOT_TABLE_NAME)
  end

  # Returns family and (when has it) qualifier for a column name
  def parse_column_name(column)
    split = org.apache.hadoop.hbase.KeyValue.parseColumn(column.to_java_bytes)
    return split[0], (split.length > 1) ? split[1] : nil
  end

  # Make a String of the passed kv
  # Intercept cells whose format we know such as the info:regioninfo in .META.
  def to_string(column, kv, maxlength = -1)
    if is_meta_table?
      if column == 'info:regioninfo' or column == 'info:splitA' or column == 'info:splitB'
        hri = org.apache.hadoop.hbase.util.Writables.getHRegionInfoOrNull(kv.getValue)
        return "timestamp=%d, value=%s" % [kv.getTimestamp, hri.toString]
      end
      if column == 'info:serverstartcode'
        if kv.getValue.length > 0
          str_val = org.apache.hadoop.hbase.util.Bytes.toLong(kv.getValue)
        else
          str_val = org.apache.hadoop.hbase.util.Bytes.toStringBinary(kv.getValue)
        end
        return "timestamp=%d, value=%s" % [kv.getTimestamp, str_val]
      end
    end

    val = "timestamp=#{kv.getTimestamp}, value=#{org.apache.hadoop.hbase.util.Bytes::toStringBinary(kv.getValue)}"
    (maxlength != -1) ? val[0, maxlength] : val
  end


  def to_string_scan_counter(column, kv, maxlength = -1)
    if is_meta_table?
      if column == 'info:regioninfo' or column == 'info:splitA' or column == 'info:splitB'
        hri = org.apache.hadoop.hbase.util.Writables.getHRegionInfoOrNull(kv.getValue)
        return "timestamp=%d, value=%s" % [kv.getTimestamp, hri.toString]
      end
      if column == 'info:serverstartcode'
        if kv.getValue.length > 0
          str_val = org.apache.hadoop.hbase.util.Bytes.toLong(kv.getValue)
        else
          str_val = org.apache.hadoop.hbase.util.Bytes.toStringBinary(kv.getValue)
        end
        return "timestamp=%d, value=%s" % [kv.getTimestamp, str_val]
      end
    end

    val = "timestamp=#{kv.getTimestamp}, value=#{org.apache.hadoop.hbase.util.Bytes::toLong(kv.getValue)}"
    (maxlength != -1) ? val[0, maxlength] : val
  end

second:

add to /usr/lib/hbase/lib/ruby/shell/commands/ the following file called: scan_counter.rb

  #
# Copyright 2010 The Apache Software Foundation
#
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements.  See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership.  The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

module Shell
  module Commands
    class ScanCounter < Command
      def help
        return <<-EOF
Scan a table with cell value that is long; pass table name and optionally a dictionary of scanner
specifications.  Scanner specifications may include one or more of:
TIMERANGE, FILTER, LIMIT, STARTROW, STOPROW, TIMESTAMP, MAXLENGTH,
or COLUMNS. If no columns are specified, all columns will be scanned.
To scan all members of a column family, leave the qualifier empty as in
'col_family:'.

Some examples:

  hbase> scan_counter '.META.'
  hbase> scan_counter '.META.', {COLUMNS => 'info:regioninfo'}
  hbase> scan_counter 't1', {COLUMNS => ['c1', 'c2'], LIMIT => 10, STARTROW => 'xyz'}
  hbase> scan_counter 't1', {FILTER => org.apache.hadoop.hbase.filter.ColumnPaginationFilter.new(1, 0)}
  hbase> scan_counter 't1', {COLUMNS => 'c1', TIMERANGE => [1303668804, 1303668904]}

For experts, there is an additional option -- CACHE_BLOCKS -- which
switches block caching for the scanner on (true) or off (false).  By
default it is enabled.  Examples:

  hbase> scan_counter 't1', {COLUMNS => ['c1', 'c2'], CACHE_BLOCKS => false}
EOF
      end

      def command(table, args = {})
        now = Time.now
        formatter.header(["ROW", "COLUMN+CELL"])

        count = table(table).scan_counter(args) do |row, cells|
          formatter.row([ row, cells ])
        end

        formatter.footer(now, count)
      end
    end
  end
end

finally

add to /usr/lib/hbase/lib/ruby/shell.rb the function scan_counter.

replace the current function with this: (you can identify it by: 'DATA MANIPULATION COMMANDS',)

Shell.load_command_group(
  'dml',
  :full_name => 'DATA MANIPULATION COMMANDS',
  :commands => %w[
    count
    delete
    deleteall
    get
    get_counter
    incr
    put
    scan
    scan_counter
    truncate
  ]
)

answered Nov 01 '22 05:11

Udy

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

HBase shell scan bytes to string conversion

Tags:

hbase

jruby

Yuri Levinsky

People also ask

3 Answers

Lorand Bendig

kdawg

Udy

Recent Activity

Donate For Us

HBase shell scan bytes to string conversion

Tags:

hbase

jruby

Yuri Levinsky

People also ask

3 Answers

Lorand Bendig

kdawg

Udy

Related questions

Recent Activity

Donate For Us