I am using Spark on a Google Cloud Dataproc cluster and I would like to access Bigtable in a PySpark job. Do we have any Bigtable connector for Spark like Google BigQuery connector?
How can we access Bigtable from a PySpark application?
Cloud Bigtable is usually best accessed from Spark using the Apache HBase APIs.
HBase currently only provides Hadoop MapReduce I/O formats. These can be accessed from Spark (or PySpark) using the SparkContext.newAPIHadoopRDD
methods. However converting the records into something usable in Python is difficult.
HBase is developing Spark SQL APIs, but these have not been integrated in a released version. Hortonworks has a Spark HBase Connector, but it compiles against Spark 1.6 (which requires Cloud Dataproc version 1.0) and I have not used it, so I cannot speak to how easy it is to use.
Alternatively you could use a Python based Bigtable client, and simply use PySpark for parallelism.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With