How to read and write data in Google Cloud Bigtable in PySpark application?

Question

I am using Spark on a Google Cloud Dataproc cluster and I would like to access Bigtable in a PySpark job. Do we have any Bigtable connector for Spark like Google BigQuery connector?

How can we access Bigtable from a PySpark application?

Patrick Clay · Accepted Answer

Cloud Bigtable is usually best accessed from Spark using the Apache HBase APIs.

HBase currently only provides Hadoop MapReduce I/O formats. These can be accessed from Spark (or PySpark) using the SparkContext.newAPIHadoopRDD methods. However converting the records into something usable in Python is difficult.

HBase is developing Spark SQL APIs, but these have not been integrated in a released version. Hortonworks has a Spark HBase Connector, but it compiles against Spark 1.6 (which requires Cloud Dataproc version 1.0) and I have not used it, so I cannot speak to how easy it is to use.

Alternatively you could use a Python based Bigtable client, and simply use PySpark for parallelism.

How to read and write data in Google Cloud Bigtable in PySpark application?

Tags:

apache-spark

pyspark

google-cloud-bigtable

google-cloud-dataproc

Revan

1 Answers

Patrick Clay

Recent Activity

Donate For Us

How to read and write data in Google Cloud Bigtable in PySpark application?

Tags:

apache-spark

pyspark

google-cloud-bigtable

google-cloud-dataproc

Revan

1 Answers

Patrick Clay

Related questions

Recent Activity

Donate For Us