I know that HBASE is a columnar database that stores structured data of tables into HDFS by column instead of by row. I know that Spark can read/write from HDFS and that there is some HBASE-connector for Spark that can now also read-write HBASE tables.
Questions:
1) What are the added capabilities brought by layering Spark on top of HBASE instead of using HBASE solely? It depends only on programmer capabilities or is there any performance reason to do that? Are there things Spark can do and HBASE solely can't do?
2) Stemming from previous question, when you should add HBASE between HDFS and SPARK instead of using directly HDFS?
Below is the difference between HDFS vs HBase are as follows: HDFS is a distributed file system that is well suited for the storage of large files. But HBase, on the other hand, is built on top of HDFS and provides fast record lookups (and updates) for large tables. HDFS has based on GFS file system.
What are HDFS and HBase? HDFS is fault-tolerant by design and supports rapid data transfer between nodes even during system failures. HBase is a non-relational and open source Not-Only-SQL database that runs on top of Hadoop.
Data fragments in Hadoop can be too large and can create bottlenecks. Thus, it is slower than Spark. Spark is much faster as it uses MLib for computations and has in-memory processing. Hadoop has a slower performance as it uses disk for storage and depends upon disk read and write operations.
Spark is a fast and general processing engine compatible with Hadoop data. It can run in Hadoop clusters through YARN or Spark's standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop InputFormat.
1) What are the added capabilities brought by layering Spark on top of HBASE instead of using HBASE solely? It depends only on programmer capabilities or is there any performance reason to do that? Are there things Spark can do and HBASE solely can't do?
At Splice Machine, we use Spark for our analytics on top of HBase. HBase does not have an execution engine and spark provides a competent execution engine on top of HBase (Intermediate results, Relational Algebra, etc.). HBase is a MVCC storage structure and Spark is an execution engine. They are natural complements to one another.
2) Stemming from previous question, when you should add HBASE between HDFS and SPARK instead of using directly HDFS?
Small reads, concurrent write/read patterns, incremental updates (most etl)
Good luck...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With