I've been developing Spark jobs for some years using on-premise clusters and our team recently moved to the Google Cloud Platform allowing us to leverage the power of BigQuery and such. The thing is, I now often find myself writing processing steps in SQL more than in PySpark since it is : <ul> <li>easier to reason about (less verbose)</li> <li>easier to maintain (SQL vs scala/python code)</li> <li>you can run it easily on the GUI if needed</li> <li>fast without having to really reason about partitioning, caching and so on...</li> </ul> In the end, I only use Spark when I've got something to do that I can't express using SQL. To be clear, my workflow is often like : <ul> <li>preprocessing (previously in Spark, now in SQL)</li> <li>feature engineering (previously in Spark, now mainly in SQL)</li> <li>machine learning model and predictions (Spark ML)</li> </ul> Am I missing something ? Is there any con in using BigQuery this way instead of Spark ? Thanks

A con I can see is the additional time required by the Hadoop cluster to create and finish the job. By making a direct request to BigQuery, this extra time can be decreased. If your tasks need parallel processing, I would recommend using Spark, but if your app is mainly used to access to BQ, you might want to use the BQ Client Libraries and separate your current tasks: <ul> <li>BigQuery Client Libraries. They are optimized to connect to BQ. Here is a QuickStart and you can use different programming languages like python or java, among others. </li> <li>Spark jobs. If you still need to perform transformations in Spark and need to read the data from BQ you can use the Dataproc-BQ connector. While this connector is installed in Dataproc by default, you can install it on-premises so that you can continue running you SparkML jobs with BQ data. Just in case it helps, you might want to consider using some GCP services like AutoML, BQ ML, AI Platform Notebooks, etc., they are specialized services for Machine Learning and AI. </li> </ul>

BigQuery replaced most of my Spark jobs, am I missing something?

1 Answers

A con I can see is the additional time required by the Hadoop cluster to create and finish the job. By making a direct request to BigQuery, this extra time can be decreased.

If your tasks need parallel processing, I would recommend using Spark, but if your app is mainly used to access to BQ, you might want to use the BQ Client Libraries and separate your current tasks:

BigQuery Client Libraries. They are optimized to connect to BQ. Here is a QuickStart and you can use different programming languages like python or java, among others.
Spark jobs. If you still need to perform transformations in Spark and need to read the data from BQ you can use the Dataproc-BQ connector. While this connector is installed in Dataproc by default, you can install it on-premises so that you can continue running you SparkML jobs with BQ data. Just in case it helps, you might want to consider using some GCP services like AutoML, BQ ML, AI Platform Notebooks, etc., they are specialized services for Machine Learning and AI.

answered Sep 25 '22 18:09

rsantiago

Related questions
                            
                                One 400GB table, One query - Need Tuning Ideas (SQL2005)
                            
                                SQL Server Check for IsNull and for Zero
                            
                                Cassandra CQL - NoSQL or SQL
                            
                                Simple way to prevent a Divide By Zero error in SQL
                            
                                How to generate date series to occupy absent dates in google BiqQuery?
                            
                                Simple SQL select in C#?
                            
                                BigQuery SQL for 28-day sliding window aggregate (without writing 28 lines of SQL)
                            
                                OR conflict between other conditions
                            
                                Handling Null in Greatest function in Oracle
                            
                                phpMyAdmin error: #1054 - Unknown column 'systeem_eisen' in 'order clause'
                            
                                Find and remove duplicate rows by two columns
                            
                                How to combine two tables in a query
                            
                                Exclude soft deleted items in self referential relationship SQLAlchemy
                            
                                Visual Studio 2010 database edition schema compare where target is dbproj
                            
                                Deleting file created from mysql
                            
                                Clarifying the difference between row-level lock in InnoDB engine and table-level lock in MyISAM engine in MySQL database
                            
                                Why would wrapping a TSQL query in an if statement increase its runtime significantly?
                            
                                Merging duplicated records together with "Merge" syntax
                            
                                Semantics of INSERT SELECT FOR UPDATE ON CONFLICT DO NOTHING RETURNING

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

BigQuery replaced most of my Spark jobs, am I missing something?

Tags:

sql

apache-spark

apache-spark-sql

google-bigquery

bigdata

CARREAU Clément

People also ask

1 Answers

rsantiago

Recent Activity

Donate For Us