I have an HBase table, with 100M+ rows and 1 Million+ columns. Every row has data for only 2 to 5 columns. There is in just 1 Column Family.
I want to find out all the distinct qualifiers
(columns) in this column family
. Is there a quick way to do that?
I can think of about scanning the whole table, then getting familyMap
for each row, get qualifier
and add it to a Set<>
. But that would be awfully slow, as there are 100M+ rows.
Can we do any better?
You can retrieve data from the HBase table using the get() method of the HTable class. This method extracts a cell from a given row. It requires a Get class object as parameter.
An HBase table is made of column families which are the logical and physical grouping of columns. The columns in one family are stored separately from the columns in another family.
HBase is a Columnar Database, usually categorized as a NoSQL database. HBase is built on top of Hadoop and shares many concepts with Google's BigData, mainly its data model. In HBase data is stored in tables, being each table composed of rows and column families.
You can use a mapreduce for this. In this case you don't need to install a custom libs for hbase as in case for coprocessor. Below a code for creating a mapreduce task.
Job setup
Job job = Job.getInstance(config);
job.setJobName("Distinct columns");
Scan scan = new Scan();
scan.setBatch(500);
scan.addFamily(YOU_COLUMN_FAMILY_NAME);
scan.setFilter(new KeyOnlyFilter()); //scan only key part of KeyValue (raw, column family, column)
scan.setCacheBlocks(false); // don't set to true for MR jobs
TableMapReduceUtil.initTableMapperJob(
YOU_TABLE_NAME,
scan,
OnlyColumnNameMapper.class, // mapper
Text.class, // mapper output key
Text.class, // mapper output value
job);
job.setNumReduceTasks(1);
job.setReducerClass(OnlyColumnNameReducer.class);
job.setReducerClass(OnlyColumnNameReducer.class);
Mapper
public class OnlyColumnNameMapper extends TableMapper<Text, Text> {
@Override
protected void map(ImmutableBytesWritable key, Result value, final Context context) throws IOException, InterruptedException {
CellScanner cellScanner = value.cellScanner();
while (cellScanner.advance()) {
Cell cell = cellScanner.current();
byte[] q = Bytes.copy(cell.getQualifierArray(),
cell.getQualifierOffset(),
cell.getQualifierLength());
context.write(new Text(q),new Text());
}
}
}
Reducer
public class OnlyColumnNameReducer extends Reducer<Text, Text, Text, Text> {
@Override
protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
context.write(new Text(key), new Text());
}
}
HBase can be visualised as a distributed NavigableMap<byte[], NavigableMap<byte[], NavigableMap<byte[], NavigableMap<Long, byte[]>>>>
There is no "metadata" (say something centrally stored in the master node) about the list of all qualifiers that's available in all region servers.
So if you have a one-time use-case, the only way for you would be to scan through the entire table and add the qualifier names in a Set<>
, like you mentioned.
If this is a repeat use-case (plus if you have the discretion to add components to your tech stack), you may want to consider adding Redis. Set of qualifiers can be maintained in a distributed fashion using a Redis Set.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With