<h3>Setup:</h3> <p>I have an HBase table, with 100M+ rows and 1 Million+ columns. Every row has data for only 2 to 5 columns. There is in just 1 Column Family.</p> <h3>Problem:</h3> <p>I want to find out all the distinct <code>qualifiers</code> (columns) in this <code>column family</code>. Is there a quick way to do that?</p> <p>I can think of about scanning the whole table, then getting <code>familyMap</code> for each row, get <code>qualifier</code> and add it to a <code>Set<></code>. But that would be awfully slow, as there are 100M+ rows.</p> <p>Can we do any better?</p>

<p>You can use a mapreduce for this. In this case you don't need to install a custom libs for hbase as in case for coprocessor. Below a code for creating a mapreduce task. </p> <p>Job setup </p> <pre class="prettyprint"><code> Job job = Job.getInstance(config); job.setJobName("Distinct columns"); Scan scan = new Scan(); scan.setBatch(500); scan.addFamily(YOU_COLUMN_FAMILY_NAME); scan.setFilter(new KeyOnlyFilter()); //scan only key part of KeyValue (raw, column family, column) scan.setCacheBlocks(false); // don't set to true for MR jobs TableMapReduceUtil.initTableMapperJob( YOU_TABLE_NAME, scan, OnlyColumnNameMapper.class, // mapper Text.class, // mapper output key Text.class, // mapper output value job); job.setNumReduceTasks(1); job.setReducerClass(OnlyColumnNameReducer.class); job.setReducerClass(OnlyColumnNameReducer.class); </code></pre> <p>Mapper</p> <pre class="prettyprint"><code> public class OnlyColumnNameMapper extends TableMapper<Text, Text> { @Override protected void map(ImmutableBytesWritable key, Result value, final Context context) throws IOException, InterruptedException { CellScanner cellScanner = value.cellScanner(); while (cellScanner.advance()) { Cell cell = cellScanner.current(); byte[] q = Bytes.copy(cell.getQualifierArray(), cell.getQualifierOffset(), cell.getQualifierLength()); context.write(new Text(q),new Text()); } } </code></pre> <p>}</p> <p>Reducer</p> <pre class="prettyprint"><code>public class OnlyColumnNameReducer extends Reducer<Text, Text, Text, Text> { @Override protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException { context.write(new Text(key), new Text()); } } </code></pre>

Can we get all the column names from an HBase table?

Setup:

I have an HBase table, with 100M+ rows and 1 Million+ columns. Every row has data for only 2 to 5 columns. There is in just 1 Column Family.

Problem:

I want to find out all the distinct qualifiers (columns) in this column family. Is there a quick way to do that?

I can think of about scanning the whole table, then getting familyMap for each row, get qualifier and add it to a Set<>. But that would be awfully slow, as there are 100M+ rows.

Can we do any better?

317

asked Oct 19 '15 23:10

Bhushan

2 Answers

You can use a mapreduce for this. In this case you don't need to install a custom libs for hbase as in case for coprocessor. Below a code for creating a mapreduce task.

Job setup

    Job job = Job.getInstance(config);
    job.setJobName("Distinct columns");

    Scan scan = new Scan();
    scan.setBatch(500);
    scan.addFamily(YOU_COLUMN_FAMILY_NAME);
    scan.setFilter(new KeyOnlyFilter()); //scan only key part of KeyValue (raw, column family, column)
    scan.setCacheBlocks(false);  // don't set to true for MR jobs


    TableMapReduceUtil.initTableMapperJob(
            YOU_TABLE_NAME,
            scan,          
            OnlyColumnNameMapper.class,   // mapper
            Text.class,             // mapper output key
            Text.class,             // mapper output value
            job);

    job.setNumReduceTasks(1);
    job.setReducerClass(OnlyColumnNameReducer.class);
    job.setReducerClass(OnlyColumnNameReducer.class);

Mapper

 public class OnlyColumnNameMapper extends TableMapper<Text, Text> {
    @Override
    protected void map(ImmutableBytesWritable key, Result value, final Context context) throws IOException, InterruptedException {
       CellScanner cellScanner = value.cellScanner();
       while (cellScanner.advance()) {

          Cell cell = cellScanner.current();
          byte[] q = Bytes.copy(cell.getQualifierArray(),
                                cell.getQualifierOffset(),
                                cell.getQualifierLength());

          context.write(new Text(q),new Text());  

       }
 }

}

Reducer

public class OnlyColumnNameReducer extends Reducer<Text, Text, Text, Text> {

    @Override
    protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {    
            context.write(new Text(key), new Text());    
    }
}

113

answered Sep 30 '22 00:09

Alexander Kuznetsov

HBase can be visualised as a distributed NavigableMap<byte[], NavigableMap<byte[], NavigableMap<byte[], NavigableMap<Long, byte[]>>>>

There is no "metadata" (say something centrally stored in the master node) about the list of all qualifiers that's available in all region servers.

So if you have a one-time use-case, the only way for you would be to scan through the entire table and add the qualifier names in a Set<>, like you mentioned.

If this is a repeat use-case (plus if you have the discretion to add components to your tech stack), you may want to consider adding Redis. Set of qualifiers can be maintained in a distributed fashion using a Redis Set.

answered Sep 30 '22 00:09

Manu Manjunath

Related questions
                            
                                Handling Writables fully qualified name changes in Hadoop SequenceFile
                            
                                How is data locality utilized for filesystems other than HDFS in Hadoop?
                            
                                recover deleted data from hdfs
                            
                                How can I specify Hadoop XML configuration variables via the Hadoop shell scripts?
                            
                                Reading remote HDFS file with Java
                            
                                Hbase scan with offset
                            
                                Hadoop YARN - how to limit requestedMemory?
                            
                                Hbase managed zookeeper suddenly trying to connect to localhost instead of zookeeper quorum
                            
                                How to convert .txt / .csv file to ORC format
                            
                                mrjob: setup logging on EMR
                            
                                getmerge command in hadoop datacopy
                            
                                Difference between hive thrift server from hive and spark distributions
                            
                                Hadoop's HDFS with Spark
                            
                                Spark - failed on connection exception: java.net.ConnectException - localhost
                            
                                How does Hadoop get input data not stored on HDFS?
                            
                                Getting an error on running HCatalog
                            
                                Can I change Spark's executor memory at runtime?
                            
                                NoSuchMethodError writing Avro object to HDFS using Builder
                            
                                Unable to connect with azure blob storage with local hadoop
                            
                                Hive : casting array<string> to array<int> in query

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Can we get all the column names from an HBase table?

Tags:

hadoop

hbase