Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read from BigQuery into Spark in efficient way?

When using BigQuery Connector to read data from BigQuery I found that it copies all data first to Google Cloud Storage. Then reads this data in parallel into Spark, but when reading big table it takes very long time in copying data stage. So is there more efficient way to read data from BigQuery into Spark?

Another Question: reading from BigQuery composed of 2 stages (copying to GCS, reading in parallel from GCS). does copying stage affected by Spark cluster size or it take fixed time?

like image 872
Mahmoud Hanafy Avatar asked Jan 04 '17 10:01

Mahmoud Hanafy


People also ask

Is Spark faster than BigQuery?

For both small and large datasets, user queries' performance on the BigQuery Native platform was significantly better than that on the Spark Dataproc cluster.

Does BigQuery support Spark?

The spark-bigquery-connector is used with Apache Spark to read and write data from and to BigQuery. This tutorial provides example code that uses the spark-bigquery-connector within a Spark application.


2 Answers

Maybe a Googler will correct me, but AFAIK that's the only way. This is because under the hood it also uses the BigQuery Connector for Hadoop, which accordng to the docs:

The BigQuery connector for Hadoop downloads data into your Google Cloud Storage bucket before running a Hadoop job..

As a side note, this is also true when using Dataflow - it too performs an export of BigQuery table(s) to GCS first and then reads them in parallel.

WRT whether or not the copying stage (which is essentially a BigQuery export job) is influenced by your Spark cluster size, or if it's a fixed time - no. BigQuery export jobs are nondeterministic, and BigQuery uses its own resources for exporting to GCS i.e. not your Spark cluster.

like image 116
Graham Polley Avatar answered Nov 12 '22 03:11

Graham Polley


spark-bigquery-connector uses the BigQuery storage API which is super fast.

like image 27
SANN3 Avatar answered Nov 12 '22 03:11

SANN3