Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark with HBASE vs Spark with HDFS

I know that HBASE is a columnar database that stores structured data of tables into HDFS by column instead of by row. I know that Spark can read/write from HDFS and that there is some HBASE-connector for Spark that can now also read-write HBASE tables.

Questions:

1) What are the added capabilities brought by layering Spark on top of HBASE instead of using HBASE solely? It depends only on programmer capabilities or is there any performance reason to do that? Are there things Spark can do and HBASE solely can't do?

2) Stemming from previous question, when you should add HBASE between HDFS and SPARK instead of using directly HDFS?

like image 610
Johan Avatar asked Aug 13 '16 08:08

Johan


People also ask

Is HDFS better or HBase?

Below is the difference between HDFS vs HBase are as follows: HDFS is a distributed file system that is well suited for the storage of large files. But HBase, on the other hand, is built on top of HDFS and provides fast record lookups (and updates) for large tables. HDFS has based on GFS file system.

Is HBase and HDFS same?

What are HDFS and HBase? HDFS is fault-tolerant by design and supports rapid data transfer between nodes even during system failures. HBase is a non-relational and open source Not-Only-SQL database that runs on top of Hadoop.

Why is Spark better than HDFS?

Data fragments in Hadoop can be too large and can create bottlenecks. Thus, it is slower than Spark. Spark is much faster as it uses MLib for computations and has in-memory processing. Hadoop has a slower performance as it uses disk for storage and depends upon disk read and write operations.

Can we use Spark with HDFS?

Spark is a fast and general processing engine compatible with Hadoop data. It can run in Hadoop clusters through YARN or Spark's standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop InputFormat.


1 Answers

1) What are the added capabilities brought by layering Spark on top of HBASE instead of using HBASE solely? It depends only on programmer capabilities or is there any performance reason to do that? Are there things Spark can do and HBASE solely can't do?

At Splice Machine, we use Spark for our analytics on top of HBase. HBase does not have an execution engine and spark provides a competent execution engine on top of HBase (Intermediate results, Relational Algebra, etc.). HBase is a MVCC storage structure and Spark is an execution engine. They are natural complements to one another.

2) Stemming from previous question, when you should add HBASE between HDFS and SPARK instead of using directly HDFS?

Small reads, concurrent write/read patterns, incremental updates (most etl)

Good luck...

like image 81
John Leach Avatar answered Oct 31 '22 16:10

John Leach