Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What to use.. Impala on HDFS, or Impala on Hbase or just the Hbase?

I am working on Proof of Concept task. The task is to implement a feature of our product using Hadoop technology.

Feature is quite simple, we have a UI which will let you insert details about "Network Issue". All details about such a issue are captured and inserted into a table in Oracle DB. We then process data in this table and calculate a Health Score.

I have to use Hadoop instead of a traditional Db So my question is what to go for? Impala on HDFS? or Impala on Hbase ? or Hbase?

I am using a cloudera VM for the POC implementation.

As per my understanding, Hbase is NoSQL distributed database, which is actually a layer on HDFS , which provides java APIs to access data. Impala is a tool which also provides JDBC access to access data over Hbase or directly over HDFS. I am very new to hadoop, can some one please help?

like image 890
Ameya Y Avatar asked Jul 09 '13 06:07

Ameya Y


People also ask

Can Impala query HBase?

Impala uses the HBase client API via Java Native Interface (JNI) to query data stored in HBase. This querying does not read HFiles directly.

What is Impala used for in Hadoop?

Impala is a MPP (Massive Parallel Processing) SQL query engine for processing huge volumes of data that is stored in Hadoop cluster. It is an open source software which is written in C++ and Java. It provides high performance and low latency compared to other SQL engines for Hadoop.

Is HBase data stored in HDFS?

By default Hbase stores the data in HDFS. It is possible to run HBase over other distributed file systems like Amazon s3, GFS etc. We can't edit hdfs, but we can append data to HDFS. HDFS supports append feature.

Why Impala queries are faster than Hive?

Impala is faster than Hive because it's a whole different engine and Hive is over MapReduce (which is very slow due to its too many disk I/O operations).


1 Answers

Well, it depends on several things, like the kind of processing you are going to perform, desired response time etc. But by looking at whatever you have written here, HBase seems to be fine. I don't find any need of Impala as of now. HBase API is good and will serve your most of the needs.

IMHO, it's better to keep things simple initially and add a tool only if it is really required. Same holds good here. If you reach a point where you find that HBase API is not able to serve the purpose you could definitely add Impala to your stack.

That being said, there is one thing which you should keep in mind. HBase is a NoSQL DB and doesn't follow RDBMS conventions and terminologies. So, you might find it a bit strange initially. It's better to keep this in mind and then proceed as you have to design the schema in a way which is totally different from the RDBMS style of schema design.

like image 198
Tariq Avatar answered Sep 19 '22 12:09

Tariq