I've been developing Spark jobs for some years using on-premise clusters and our team recently moved to the Google Cloud Platform allowing us to leverage the power of BigQuery and such.
The thing is, I now often find myself writing processing steps in SQL more than in PySpark since it is :
In the end, I only use Spark when I've got something to do that I can't express using SQL.
To be clear, my workflow is often like :
Am I missing something ? Is there any con in using BigQuery this way instead of Spark ?
Thanks
However, despite its unique advantages and powerful features, BigQuery is not a silver bullet. It is not recommended to use it on data that changes too often and, due to its storage location bound to Google's own services and processing limitations it's best not to use it as a primary data storage.
1. For both small and large datasets, user queries' performance on the BigQuery Native platform was significantly better than that on the Spark Dataproc cluster.
Stay organized with collections Save and categorize content based on your preferences. The spark-bigquery-connector is used with Apache Spark to read and write data from and to BigQuery.
A con I can see is the additional time required by the Hadoop cluster to create and finish the job. By making a direct request to BigQuery, this extra time can be decreased.
If your tasks need parallel processing, I would recommend using Spark, but if your app is mainly used to access to BQ, you might want to use the BQ Client Libraries and separate your current tasks:
BigQuery Client Libraries. They are optimized to connect to BQ. Here is a QuickStart and you can use different programming languages like python or java, among others.
Spark jobs. If you still need to perform transformations in Spark and need to read the data from BQ you can use the Dataproc-BQ connector. While this connector is installed in Dataproc by default, you can install it on-premises so that you can continue running you SparkML jobs with BQ data. Just in case it helps, you might want to consider using some GCP services like AutoML, BQ ML, AI Platform Notebooks, etc., they are specialized services for Machine Learning and AI.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With