Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hbase quickly count number of rows

Right now I implement row count over ResultScanner like this

for (Result rs = scanner.next(); rs != null; rs = scanner.next()) {
    number++;
}

If data reaching millions time computing is large.I want to compute in real time that i don't want to use Mapreduce

How to quickly count number of rows.

like image 773
cldo Avatar asked Jul 07 '12 12:07

cldo


People also ask

What is the use of get () method in HBase?

You can retrieve data from the HBase table using the get() method of the HTable class. This method extracts a cell from a given row. It requires a Get class object as parameter.

How does TTL work in HBase?

You can set ColumnFamilies a TTL length in seconds, and HBase will automatically delete rows or automatically expires the row once the expiration time is reached. This setting applies to all versions of a row in that table– even the current one. The TTL time encoded in the HBase for the row is specified in UTC.

Does HBase have tables?

An HBase table is a multi-dimensional map comprised of one or more columns and rows of data. You specify the complete set of column families when you create an HBase table. An HBase cell is comprised of a row (column family, column qualifier, column value) and a timestamp.


11 Answers

Use RowCounter in HBase RowCounter is a mapreduce job to count all the rows of a table. This is a good utility to use as a sanity check to ensure that HBase can read all the blocks of a table if there are any concerns of metadata inconsistency. It will run the mapreduce all in a single process but it will run faster if you have a MapReduce cluster in place for it to exploit.

$ hbase org.apache.hadoop.hbase.mapreduce.RowCounter <tablename>

Usage: RowCounter [options] 
    <tablename> [          
        --starttime=[start] 
        --endtime=[end] 
        [--range=[startKey],[endKey]] 
        [<column1> <column2>...]
    ]
like image 57
Basil Saju Avatar answered Sep 22 '22 09:09

Basil Saju


You can use the count method in hbase to count the number of rows. But yes, counting rows of a large table can be slow.count 'tablename' [interval]

Return value is the number of rows.

This operation may take a LONG time (Run ‘$HADOOP_HOME/bin/hadoop jar hbase.jar rowcount’ to run a counting mapreduce job). Current count is shown every 1000 rows by default. Count interval may be optionally specified. Scan caching is enabled on count scans by default. Default cache size is 10 rows. If your rows are small in size, you may want to increase this parameter.

Examples:

hbase> count 't1'

hbase> count 't1', INTERVAL => 100000

hbase> count 't1', CACHE => 1000

hbase> count 't1', INTERVAL => 10, CACHE => 1000

The same commands also can be run on a table reference. Suppose you had a reference to table 't1', the corresponding commands would be:

hbase> t.count

hbase> t.count INTERVAL => 100000

hbase> t.count CACHE => 1000

hbase> t.count INTERVAL => 10, CACHE => 1000
like image 21
Neethu Lalitha Avatar answered Sep 22 '22 09:09

Neethu Lalitha


If you cannot use RowCounter for whatever reason, then a combination of these two filters should be an optimal way to get a count:

FirstKeyOnlyFilter() AND KeyOnlyFilter()

The FirstKeyOnlyFilter will result in the scanner only returning the first column qualifier it finds, as opposed to the scanner returning all of the column qualifiers in the table, which will minimize the network bandwith. What about simply picking one column qualifier to return? This would work if you could guarentee that column qualifier exists for every row, but if that is not true then you would get an inaccurate count.

The KeyOnlyFilter will result in the scanner only returning the column family, and will not return any value for the column qualifier. This further reduces the network bandwidth, which in the general case wouldn't account for much of a reduction, but there can be an edge case where the first column picked by the previous filter just happens to be an extremely large value.

I tried playing around with scan.setCaching but the results were all over the place. Perhaps it could help.

I had 16 million rows in between a start and stop that I did the following pseudo-empirical testing:

With FirstKeyOnlyFilter and KeyOnlyFilter activated:

    With caching not set (i.e., the default value), it took 188 seconds.
    With caching set to 1, it took 188 seconds
    With caching set to 10, it took 200 seconds
    With caching set to 100, it took 187 seconds
    With caching set to 1000, it took 183 seconds.
    With caching set to 10000, it took 199 seconds.
    With caching set to 100000, it took 199 seconds.

With FirstKeyOnlyFilter and KeyOnlyFilter disabled:

    With caching not set, (i.e., the default value), it took 309 seconds

I didn't bother to do proper testing on this, but it seems clear that the FirstKeyOnlyFilter and KeyOnlyFilter are good.

Moreover, the cells in this particular table are very small - so I think the filters would have been even better on a different table.


Here is a Java code sample:

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.client.ResultScanner;
import org.apache.hadoop.hbase.client.Scan;
import org.apache.hadoop.hbase.util.Bytes;

import org.apache.hadoop.hbase.filter.RowFilter;
import org.apache.hadoop.hbase.filter.KeyOnlyFilter; 
import org.apache.hadoop.hbase.filter.FirstKeyOnlyFilter; 
import org.apache.hadoop.hbase.filter.FilterList;

import org.apache.hadoop.hbase.filter.CompareFilter.CompareOp;
import org.apache.hadoop.hbase.filter.RegexStringComparator; 

public class HBaseCount {
    public static void main(String[] args) throws IOException {
        Configuration config = HBaseConfiguration.create();

        HTable table = new HTable(config, "my_table");

        Scan scan = new Scan(
            Bytes.toBytes("foo"), Bytes.toBytes("foo~")
        );

        if (args.length == 1) {
            scan.setCaching(Integer.valueOf(args[0]));
        }
        System.out.println("scan's caching is " + scan.getCaching());

        FilterList allFilters = new FilterList();
        allFilters.addFilter(new FirstKeyOnlyFilter());
        allFilters.addFilter(new KeyOnlyFilter());

        scan.setFilter(allFilters);

        ResultScanner scanner = table.getScanner(scan);

        int count = 0;

        long start = System.currentTimeMillis();

        try {
            for (Result rr = scanner.next(); rr != null; rr = scanner.next()) {
                count += 1;
                if (count % 100000 == 0) System.out.println(count);
            }
        } finally {
            scanner.close();
        }

        long end = System.currentTimeMillis();

        long elapsedTime = end - start;

        System.out.println("Elapsed time was " + (elapsedTime/1000F));

    }
}


Here is a pychbase code sample:

    from pychbase import Connection
    c = Connection()
    t = c.table('my_table')
    # Under the hood this applies the FirstKeyOnlyFilter and KeyOnlyFilter
    # similar to the happybase example below
    print t.count(row_prefix="foo")

Here is a Happybase code sample:

    from happybase import Connection
    c = Connection(...)
    t = c.table('my_table')
    count = 0
    for _ in t.scan(filter='FirstKeyOnlyFilter() AND KeyOnlyFilter()'):
        count += 1

    print count

Thanks to @Tuckr and @KennyCason for the tip.

like image 34
Matthew Moisen Avatar answered Sep 24 '22 09:09

Matthew Moisen


Use the HBase rowcount map/reduce job that's included with HBase

like image 40
Arnon Rotem-Gal-Oz Avatar answered Sep 26 '22 09:09

Arnon Rotem-Gal-Oz


Simple, Effective and Efficient way to count row in HBASE:

  1. Whenever you insert a row trigger this API which will increment that particular cell.

    Htable.incrementColumnValue(Bytes.toBytes("count"), Bytes.toBytes("details"), Bytes.toBytes("count"), 1);
    
  2. To check number of rows present in that table. Just use "Get" or "scan" API for that particular Row 'count'.

By using this Method you can get the row count in less than a millisecond.

like image 24
Balaji Avatar answered Sep 22 '22 09:09

Balaji


To count the Hbase table record count on a proper YARN cluster you have to set the map reduce job queue name as well:

hbase org.apache.hadoop.hbase.mapreduce.RowCounter -Dmapreduce.job.queuename= < Your Q Name which you have SUBMIT access>
 < TABLE_NAME>
like image 37
Dean Jain Avatar answered Sep 26 '22 09:09

Dean Jain


You can use coprocessor what is available since HBase 0.92. See Coprocessor and AggregateProtocol and example

like image 27
Alexander Kuznetsov Avatar answered Sep 23 '22 09:09

Alexander Kuznetsov


Two ways Worked for me to get count of rows from hbase table with Speed

Scenario #1

If hbase table size is small then login to hbase shell with valid user and execute

>count '<tablename>'

Example

>count 'employee'

6 row(s) in 0.1110 seconds

Scenario #2

If hbase table size is large,then execute inbuilt RowCounter map reduce job: Login to hadoop machine with valid user and execute:

/$HBASE_HOME/bin/hbase org.apache.hadoop.hbase.mapreduce.RowCounter '<tablename>'

Example:

 /$HBASE_HOME/bin/hbase org.apache.hadoop.hbase.mapreduce.RowCounter 'employee'

     ....
     ....
     ....
     Virtual memory (bytes) snapshot=22594633728
                Total committed heap usage (bytes)=5093457920
        org.apache.hadoop.hbase.mapreduce.RowCounter$RowCounterMapper$Counters
                ROWS=6
        File Input Format Counters
                Bytes Read=0
        File Output Format Counters
                Bytes Written=0
like image 31
Harsimranjit Singh Kler Avatar answered Sep 23 '22 09:09

Harsimranjit Singh Kler


If you're using a scanner, in your scanner try to have it return the least number of qualifiers as possible. In fact, the qualifier(s) that you do return should be the smallest (in byte-size) as you have available. This will speed up your scan tremendously.

Unfortuneately this will only scale so far (millions-billions?). To take it further, you can do this in real time but you will first need to run a mapreduce job to count all rows.

Store the Mapreduce output in a cell in HBase. Every time you add a row, increment the counter by 1. Every time you delete a row, decrement the counter.

When you need to access the number of rows in real time, you read that field in HBase.

There is no fast way to count the rows otherwise in a way that scales. You can only count so fast.

like image 37
Tucker Avatar answered Sep 22 '22 09:09

Tucker


U can find sample example here:

/**
     * Used to get the number of rows of the table
     * @param tableName
     * @param familyNames
     * @return the number of rows
     * @throws IOException
     */
    public long countRows(String tableName, String... familyNames) throws IOException {
        long rowCount = 0;
        Configuration configuration = connection.getConfiguration();
        // Increase RPC timeout, in case of a slow computation
        configuration.setLong("hbase.rpc.timeout", 600000);
        // Default is 1, set to a higher value for faster scanner.next(..)
        configuration.setLong("hbase.client.scanner.caching", 1000);

        AggregationClient aggregationClient = new AggregationClient(configuration);
        try {
            Scan scan = new Scan();
            if (familyNames != null && familyNames.length > 0) {
                for (String familyName : familyNames) {
                    scan.addFamily(Bytes.toBytes(familyName));
                }
            }
            rowCount = aggregationClient.rowCount(TableName.valueOf(tableName), new LongColumnInterpreter(), scan);
        } catch (Throwable e) {
            throw new IOException(e);
        }
        return rowCount;
    }
like image 35
Ranga Reddy Avatar answered Sep 26 '22 09:09

Ranga Reddy


Go to Hbase home directory and run this command,

./bin/hbase org.apache.hadoop.hbase.mapreduce.RowCounter 'namespace:tablename'

This will launch a mapreduce job and the output will show the number of records existing in the hbase table.

like image 42
jack AKA karthik Avatar answered Sep 24 '22 09:09

jack AKA karthik