Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

BigQueryIO.read().fromQuery performance slow

One of the things I've noticed is that the performance of BigQueryIO.read().fromQuery() is quite slower than the performance of BigQueryIO.read().from() in Apache Beam. Why does this happen? And is there any way to improve it?

like image 895
rish0097 Avatar asked Apr 18 '18 11:04

rish0097


1 Answers

As discussed in this other Stack Overflow similar question (where they actually discussed about pricing and not speed), the reason why you perceive notable differences between those two BigQueryIO Reading methods is because, under the hood, they do completely different operations:

  • BigQueryIO.read.from() directly reads the whole table from BigQuery. This function exports the whole table to temporary files in Google Cloud Storage, where it will later be read from. This requires almost no computation, as it only performs an export job, and later Dataflow reads from GCS (not from BigQuery).
  • BigQueryIO.read.fromQuery() executes a query and then reads the results received after the query execution. Therefore, this function is more time-consuming, given that it requires that a query is first executed (which will incur in the corresponding economic and computational costs).

So, in short, the main difference between those two Apache Beam functions is that one reads the whole table from a BigQuery ⟷ GCS export, while the other executes a query and later reads its results. Their underlying concepts are completely different, so as an answer to your question about if there is any way to improve the performance, no, there is nothing you can do a part of designing the optimal query for your use case by following the BigQuery best practices.

like image 54
dsesto Avatar answered Oct 26 '22 18:10

dsesto