I have a large dataset stored into a BigQuery table and I would like to load it into a pypark RDD for ETL data processing. I realized that BigQuery supports the Hadoop Input / Output format https://cloud.google.com/hadoop/writing-with-bigquery-connector and pyspark should be able to use this interface in order to create an RDD by using the method "newAPIHadoopRDD". http://spark.apache.org/docs/latest/api/python/pyspark.html Unfortunately, the documentation on both ends seems scarce and goes beyond my knowledge of Hadoop/Spark/BigQuery. Is there anybody who has figured out how to do this?

Google now has an example on how to use the BigQuery connector with Spark. There does seem to be a problem using the GsonBigQueryInputFormat, but I got a simple Shakespeare word counting example working <pre class="prettyprint"><code>import json import pyspark sc = pyspark.SparkContext() hadoopConf=sc._jsc.hadoopConfiguration() hadoopConf.get("fs.gs.system.bucket") conf = {"mapred.bq.project.id": "<project_id>", "mapred.bq.gcs.bucket": "<bucket>", "mapred.bq.input.project.id": "publicdata", "mapred.bq.input.dataset.id":"samples", "mapred.bq.input.table.id": "shakespeare" } tableData = sc.newAPIHadoopRDD("com.google.cloud.hadoop.io.bigquery.JsonTextBigQueryInputFormat", "org.apache.hadoop.io.LongWritable", "com.google.gson.JsonObject", conf=conf).map(lambda k: json.loads(k[1])).map(lambda x: (x["word"], int(x["word_count"]))).reduceByKey(lambda x,y: x+y) print tableData.take(10) </code></pre>

BigQuery connector for pyspark via Hadoop Input Format example

1 Answers

Google now has an example on how to use the BigQuery connector with Spark.

There does seem to be a problem using the GsonBigQueryInputFormat, but I got a simple Shakespeare word counting example working

import json
import pyspark
sc = pyspark.SparkContext()

hadoopConf=sc._jsc.hadoopConfiguration()
hadoopConf.get("fs.gs.system.bucket")

conf = {"mapred.bq.project.id": "<project_id>", "mapred.bq.gcs.bucket": "<bucket>", "mapred.bq.input.project.id": "publicdata", "mapred.bq.input.dataset.id":"samples", "mapred.bq.input.table.id": "shakespeare"  }

tableData = sc.newAPIHadoopRDD("com.google.cloud.hadoop.io.bigquery.JsonTextBigQueryInputFormat", "org.apache.hadoop.io.LongWritable", "com.google.gson.JsonObject", conf=conf).map(lambda k: json.loads(k[1])).map(lambda x: (x["word"], int(x["word_count"]))).reduceByKey(lambda x,y: x+y)
print tableData.take(10)

answered Oct 17 '22 21:10

Matt J

Related questions
                            
                                PySpark: How to evaluate AUC of ML recomendation algorithm?
                            
                                Clean invalid characters from data held in a Spark RDD
                            
                                Spark colocated join between two partitioned dataframes
                            
                                How to use a PySpark UDF in a Scala Spark project?
                            
                                How to run simple Spark app from Eclipse/Intellij IDE?
                            
                                Working Around Performance & Memory Issues with spark-sql GROUP BY
                            
                                scala.ScalaReflectionException: <none> is not a term
                            
                                Accessing HBase tables through Spark
                            
                                Running Spark on AWS EMR, how to run driver on master node?
                            
                                how can you calculate the size of an apache spark data frame using pyspark?
                            
                                Spark 2.3 submit on Kubernetes error
                            
                                Does Spark lock the File while writing to HDFS or S3
                            
                                Merge Schema with int and double cannot be resolved when reading parquet file
                            
                                How to filter a dataset according to datetime values in Spark
                            
                                Accumulator fails on cluster, works locally
                            
                                Make YARN clean up appcache before retry
                            
                                Build stateful chain for different events and assign global ID in spark
                            
                                Unable to connect Google Storage file using GSC connector from Spark
                            
                                Spark - Serializing an object with a non-serializable member
                            
                                org.apache.spark.SparkException: Job aborted due to stage failure: Task 98 in stage 11.0 failed 4 times

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

BigQuery connector for pyspark via Hadoop Input Format example

Tags:

apache-spark

pyspark

google-bigquery

google-cloud-dataproc

google-hadoop

Luca Fiaschi

People also ask

1 Answers

Matt J

Recent Activity

Donate For Us