Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

HBase & Mahout - Using HBase as a Datastore/source for Mahout - Classification

I'm working on a large text classification project and we have our text data (simple messages) stored in HBase.

We have two problems, first we would like to use HBase as the source for Mahout classifiers namely Bayers and Random Forests.

Second, we would like to be able to store the model generated in HBase instead of using the in memory approach (InMemoryBayesDatastore) however as our sets grow we are running into problems with memory utilization and would like to test out HBase as a viable alternative.

There seems to be little material floating around using HBase with Mahout and if it's possible to use it as a potential datasource. I'm using Mahout 0.6 core API in Java which has the InMemory datastore.

Doing a bit of digging I belive that there (was) a HBase Bayers Datastore component - org.apache.mahout.classifier.bayes.datastore.HBaseBayesDatastore See older JavaDoc here: http://www.jarvana.com/jarvana/view/org/apache/mahout/mahout-core/0.3/mahout-core-0.3-javadoc.jar!/org/apache/mahout/classifier/bayes/datastore/HBaseBayesDatastore.html

However, looking at the latest documentation it looks like this feature has disappeared..? https://builds.apache.org/job/Mahout-Quality/javadoc/

I wanted to know if it was still possible to use HBase as a datastource for Bayers and RandomForests and are there any previous uses cases in this?

Thanks!

like image 665
NightWolf Avatar asked Jul 25 '11 12:07

NightWolf


People also ask

What is HBase used for?

HBase is most effectively used to store non-relational data, accessed via the HBase API. Apache Phoenix is commonly used as a SQL layer on top of HBase allowing you to use familiar SQL syntax to insert, delete, and query data stored in HBase.

Is HBase a NoSQL database?

The rise of growing data gave us the NoSQL databases and HBase is one of the NoSQL database built on top of Hadoop. This paper illustrates the HBase database its structure, use cases and challenges for HBase. HBase is suitable for the applications which require a real-time read/write access to huge datasets.

What is HBase and how it works?

HBase is a column-oriented non-relational database management system that runs on top of Hadoop Distributed File System (HDFS). HBase provides a fault-tolerant way of storing sparse data sets, which are common in many big data use cases.

Is HBase still used?

HBase is being used by multiple organizations and internally it is used company-wide. it solves a large range of problems and provides unique solutions when we need a NoSQL store. HBase provides the best of breed solutions for any NoSQL storage needs.

What is the use of a HBase?

HBase is an open-source, column-oriented distributed database system in a Hadoop environment. Initially, it was Google Big Table, afterward; it was renamed as HBase and is primarily written in Java. Apache HBase is needed for real-time Big Data applications. HBase can store massive amounts of data from terabytes to petabytes.

What are the features of Apache HBase?

Just as Bigtable leverages the distributed data storage provided by the Google File System, Apache HBase provides Bigtable-like capabilities on top of Hadoop and HDFS. Click here to download Apache HBase™. Linear and modular scalability. Strictly consistent reads and writes. Automatic failover support between RegionServers.

What is a column in a table in HBase?

HBase is a column-oriented database and the tables in it are sorted by row. The table schema defines only column families, which are the key value pairs. A table have multiple column families and each column family can have any number of columns. Subsequent column values are stored contiguously on the disk.

What is the difference between HBase and other NoSQL models?

HBase storage model is different from other NoSQL models discussed above. This can be stated as follow. HBase stores data in the form of key/value pairs in a columnar model. In this model, all the columns are grouped together as Column families.


1 Answers

It's not directly possible, no. You can revive this old implementation, and dust it off and probably make it work without much trouble. It was indeed removed to slim down and focus the project.

You can of course also look at exporting your data, in some form, and adding it to a representation or store that is directly supported.

Generally speaking, you can use HBase with Mahout by virtue of the fact that Mahout uses Hadoop (mostly) and Hadoop can use HBase. That's not quite the situation here; there's a more direct integration point here, that has been deprecated.

like image 99
Sean Owen Avatar answered Sep 26 '22 17:09

Sean Owen