Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

BigQuery replaced most of my Spark jobs, am I missing something?

I've been developing Spark jobs for some years using on-premise clusters and our team recently moved to the Google Cloud Platform allowing us to leverage the power of BigQuery and such.

The thing is, I now often find myself writing processing steps in SQL more than in PySpark since it is :

  • easier to reason about (less verbose)
  • easier to maintain (SQL vs scala/python code)
  • you can run it easily on the GUI if needed
  • fast without having to really reason about partitioning, caching and so on...

In the end, I only use Spark when I've got something to do that I can't express using SQL.

To be clear, my workflow is often like :

  • preprocessing (previously in Spark, now in SQL)
  • feature engineering (previously in Spark, now mainly in SQL)
  • machine learning model and predictions (Spark ML)

Am I missing something ? Is there any con in using BigQuery this way instead of Spark ?

Thanks

like image 415
CARREAU Clément Avatar asked May 07 '19 12:05

CARREAU Clément


People also ask

What is BigQuery not good for?

However, despite its unique advantages and powerful features, BigQuery is not a silver bullet. It is not recommended to use it on data that changes too often and, due to its storage location bound to Google's own services and processing limitations it's best not to use it as a primary data storage.

Is spark faster than BigQuery?

1. For both small and large datasets, user queries' performance on the BigQuery Native platform was significantly better than that on the Spark Dataproc cluster.

Does BigQuery support spark?

Stay organized with collections Save and categorize content based on your preferences. The spark-bigquery-connector is used with Apache Spark to read and write data from and to BigQuery.


1 Answers

A con I can see is the additional time required by the Hadoop cluster to create and finish the job. By making a direct request to BigQuery, this extra time can be decreased.

If your tasks need parallel processing, I would recommend using Spark, but if your app is mainly used to access to BQ, you might want to use the BQ Client Libraries and separate your current tasks:

  • BigQuery Client Libraries. They are optimized to connect to BQ. Here is a QuickStart and you can use different programming languages like python or java, among others.

  • Spark jobs. If you still need to perform transformations in Spark and need to read the data from BQ you can use the Dataproc-BQ connector. While this connector is installed in Dataproc by default, you can install it on-premises so that you can continue running you SparkML jobs with BQ data. Just in case it helps, you might want to consider using some GCP services like AutoML, BQ ML, AI Platform Notebooks, etc., they are specialized services for Machine Learning and AI.

like image 78
rsantiago Avatar answered Sep 25 '22 18:09

rsantiago