Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read and write data in Google Cloud Bigtable in PySpark application?

I am using Spark on a Google Cloud Dataproc cluster and I would like to access Bigtable in a PySpark job. Do we have any Bigtable connector for Spark like Google BigQuery connector?

How can we access Bigtable from a PySpark application?

like image 715
Revan Avatar asked Nov 02 '16 03:11

Revan


1 Answers

Cloud Bigtable is usually best accessed from Spark using the Apache HBase APIs.

HBase currently only provides Hadoop MapReduce I/O formats. These can be accessed from Spark (or PySpark) using the SparkContext.newAPIHadoopRDD methods. However converting the records into something usable in Python is difficult.

HBase is developing Spark SQL APIs, but these have not been integrated in a released version. Hortonworks has a Spark HBase Connector, but it compiles against Spark 1.6 (which requires Cloud Dataproc version 1.0) and I have not used it, so I cannot speak to how easy it is to use.

Alternatively you could use a Python based Bigtable client, and simply use PySpark for parallelism.

like image 148
Patrick Clay Avatar answered Oct 13 '22 08:10

Patrick Clay