Our stack is composed of Google Data Proc (Spark 2.0) and Google BigTable (HBase 1.2.0) and I am looking for a connector working with these versions. The Spark 2.0 and the new DataSet API support is not clear to me for the connectors I have found: <ul> <li> spark-hbase : https://github.com/apache/hbase/tree/master/hbase-spark </li> <li> spark-hbase-connector : https://github.com/nerdammer/spark-hbase-connector </li> <li> hortonworks-spark/shc : https://github.com/hortonworks-spark/shc </li> </ul> The project is written in Scala 2.11 with SBT. Thanks for your help

Update: SHC now seems to work with Spark 2 and the Table API. See https://github.com/GoogleCloudPlatform/cloud-bigtable-examples/tree/master/scala/bigtable-shc Original answer: I don't believe any of these (or any other existing connector) will do all that you would like today. <ul> <li> spark-hbase will probably the right solution when it is release (HBase 1.4?), but currently only builds at head and is still working on Spark 2 support.</li> <li> spark-hbase-connector only seems to support RDD APIs, but since they are more stable, might be somewhat helpful. </li> <li> hortonworks-spark/shc probably won't work because I believe it only supports Spark 1 and uses the older HTable APIs which do not work with BigTable.</li> </ul> I would recommend just using HBase MapReduce APIs with RDD methods like newAPIHadoopRDD (or possibly the spark-hbase-connector?). Then manually convert RDDs into DataSets. This approach is a lot easier in Scala or Java than Python. This is an area that the HBase community is working to improve and Google Cloud Dataproc will incorporate those improvements as they happen.

Which HBase connector for Spark 2.0 should I use? [closed]

2 Answers

In addition to the above answer, using newAPIHadoopRDD means that, you get all the data from HBase and from then on, its all core spark. You would not get any HBase specific API like Filters etc. And the current spark-hbase, only snapshots are available.

186

answered Sep 30 '22 05:09

Ramzy

Update: SHC now seems to work with Spark 2 and the Table API. See https://github.com/GoogleCloudPlatform/cloud-bigtable-examples/tree/master/scala/bigtable-shc

Original answer:

I don't believe any of these (or any other existing connector) will do all that you would like today.

spark-hbase will probably the right solution when it is release (HBase 1.4?), but currently only builds at head and is still working on Spark 2 support.
spark-hbase-connector only seems to support RDD APIs, but since they are more stable, might be somewhat helpful.
hortonworks-spark/shc probably won't work because I believe it only supports Spark 1 and uses the older HTable APIs which do not work with BigTable.

I would recommend just using HBase MapReduce APIs with RDD methods like newAPIHadoopRDD (or possibly the spark-hbase-connector?). Then manually convert RDDs into DataSets. This approach is a lot easier in Scala or Java than Python.

This is an area that the HBase community is working to improve and Google Cloud Dataproc will incorporate those improvements as they happen.

answered Sep 30 '22 07:09

Patrick Clay

Related questions
                            
                                Basic Scalaz State question
                            
                                Why does scala.Serializable not specify any methods?
                            
                                Why is there no Tuple1 Literal for single element tuples in Scala?
                            
                                How to call T eq(Object) method of Java interface from Scala?
                            
                                Is it possible to extend the Scala compiler to infer return types of recursive methods?
                            
                                Get form parameters from a post request using spray/scala
                            
                                Implementing a fixed size, immutable, and specialized vector
                            
                                scala Either.RightProjection confusion (for comprehension de-sugaring)
                            
                                Scala type parameter being inferred to tuple
                            
                                Managing imports in Scalaz7
                            
                                Why does Scala require partial application of curried functions when assigning to a val?
                            
                                How to run all Specs2 tests under IntelliJ IDEA?
                            
                                Best practice for null-checking in Scala
                            
                                Mapping over Shapeless record
                            
                                Scala: Implementing map and withFilter in a simple custom type
                            
                                Using Apache Spark as a backend for web application [closed]
                            
                                How to pass Messages when I inject MessageApi and use the I18nSupport Trait
                            
                                How to download a HTTP resource to a file with Akka Streams and HTTP?
                            
                                How to get the name of a case class field as a string/symbol at compile time using shapeless?
                            
                                Force intellij to download scala-library sources in an existing project

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Which HBase connector for Spark 2.0 should I use? [closed]

Tags:

scala

apache-spark

hbase

google-cloud-bigtable

google-cloud-dataproc

ogen

People also ask

2 Answers

Ramzy

Patrick Clay

Recent Activity

Donate For Us