Our stack is composed of Google Data Proc (Spark 2.0) and Google BigTable (HBase 1.2.0) and I am looking for a connector working with these versions.
The Spark 2.0 and the new DataSet API support is not clear to me for the connectors I have found:
The project is written in Scala 2.11 with SBT.
Thanks for your help
Spark HBase Connector(SHC) provides feature rich and efficient access to HBase through Spark SQL. It bridges the gap between the simple HBase key value store and complex relational SQL queries and enables users to perform complex data analytics on top of HBase using Spark.
In addition to the above answer, using newAPIHadoopRDD
means that, you get all the data from HBase and from then on, its all core spark. You would not get any HBase specific API like Filters etc.
And the current spark-hbase, only snapshots are available.
Update: SHC now seems to work with Spark 2 and the Table API. See https://github.com/GoogleCloudPlatform/cloud-bigtable-examples/tree/master/scala/bigtable-shc
Original answer:
I don't believe any of these (or any other existing connector) will do all that you would like today.
I would recommend just using HBase MapReduce APIs with RDD methods like newAPIHadoopRDD (or possibly the spark-hbase-connector?). Then manually convert RDDs into DataSets. This approach is a lot easier in Scala or Java than Python.
This is an area that the HBase community is working to improve and Google Cloud Dataproc will incorporate those improvements as they happen.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With