Google BigQuery Python library is 2x as fast as the Node JS library for downloading results

Q: Why is BigQuery fast?

1. Columnar Storage : Data is stored in a columnar storage fashion which makes possible to achieve a very high compression ratio and scan throughput. 2. Tree Architecture : is used for dispatching queries and aggregating results across thousands of machines in a few seconds.

Q: Does BigQuery support streaming inserts?

Streaming is not available through the free tier. If you attempt to use streaming without enabling billing, you receive the following error: BigQuery: Streaming insert is not allowed in the free tier.

Q: What is BigQuery streaming?

BigQuery streaming ingestion allows you to stream your data into BigQuery one record at a time by using the tabledata. insertAll method. The API allows uncoordinated inserts from multiple producers.

Tags:

python

javascript

google-bigquery

I've been doing a test to compare the speeds at which the Google BigQuery Python client library downloads query results compared to the Node JS library. It would seem that, out-of-the-box, the Python libraries download data about twice as fast as the Javascript Node JS client. Why is that so?

Below I provide the two tests, one in Python and one in Javascript. I've selected the usa_names public dataset of BigQuery as an example. The usa_1910_current table in this dataset is about 6 million rows and about 180Mb in size. I have a 200Mb fibre download link (for information about the last mile). The data, after being packed into a pandas dataframe, is about 1.1Gb (with Pandas overhead included).

Python test

from google.cloud import bigquery
import time
import pandas as pd

bq_client = bigquery.Client("mydata-1470162410749")

sql = """SELECT * FROM `bigquery-public-data.usa_names.usa_1910_current`"""

job_config = bigquery.QueryJobConfig()

start = time.time()
#---------------------------------------------------
query_job = bq_client.query(
    sql,
    location='US',
    job_config=job_config)  
#--------------------------------------------------- 
end = time.time()
query_time = end-start

start = time.time()
#---------------------------------------------------
rows = list(query_job.result(timeout=30))
df = pd.DataFrame(data=[list(x.values()) for x in rows], columns=list(rows[0].keys()))
#---------------------------------------------------    
end = time.time()

iteration_time = end-start
dataframe_size_mb = df.memory_usage(deep=True).sum() / 1024 ** 2
print("Size of the data in Mb: " + str(dataframe_size_mb) + " Mb")
print("Shape of the dataframe: " + str(df.shape))
print("Request time:", query_time)
print("Fetch time:", iteration_time)

Node JS test

// Import the Google Cloud client library
const {BigQuery} = require('@google-cloud/bigquery');
const moment = require('moment')

async function query() {

  const bigqueryClient = new BigQuery();
  const query = "SELECT * FROM `bigquery-public-data.usa_names.usa_1910_current`";
  const options = {
    query: query,
    location: 'US',
  };

  // Run the query as a job
  const [job] = await bigqueryClient.createQueryJob(options);
  console.log(`Job ${job.id} started.`);

  // Wait for the query to finish
  let startTime = moment.utc()
  console.log('Start: ', startTime.format("YYYY-MM-DD HH:mm:ss"));
  const [rows] = await job.getQueryResults();
  let endTime = moment.utc()
  console.log('End: ', endTime.format("YYYY-MM-DD HH:mm:ss"));
  console.log('Difference (s): ', endTime.diff(startTime) / 1000)
}

query();

Python library test results with 180Mb of data:

Size of the data in Mb: 1172.0694370269775 Mb
Shape of the dataframe: (6028151, 5)
Request time: 3.58441424369812
Fetch time: 388.0966112613678 <-- This is 6.46 mins

Node JS library test results with 180Mb of data:

Start: 2019-06-03 19:11:03
End: 2019-06-03 19:24:12 <- About 13 mins

For further reference, I also ran the tests against a 2Gb table...

Python library test results with 2Gb of data:

Size of the data in Mb: 3397.0339670181274 Mb
Shape of the dataframe: (1278004, 21)
Request time: 2.4991791248321533
Fetch time: 867.7270500659943 <-- This is 14.45mins

Node JS library test results with 2Gb of data:

Start: 2019-06-03 15:30:59
End: 2019-06-03 16:02:49 <-- The difference is just below 31 mins

906

asked Jun 03 '19 19:06

Eben du Toit

1 Answers

As I can see Node JS uses pagination to manage the datasets while Python with looks like it brings the entire datasets and start to work with it.

This is maybe affecting the performance of the Node JS client library, my recommendation is to take a look at the source code of both clients and read constantly the Google Cloud Blog, where Google publishes sometimes tips and best practices to use their products, as an example this article: Testing Cloud Pub/Sub clients to maximize streaming performance.

answered Oct 05 '22 18:10

Enrique Zetina

Related questions
                            
                                Links on mobile not working after adding auto ads
                            
                                BPMN vs Flow Based Programming
                            
                                How to resolve alias in Parceljs?
                            
                                How to use <Link /> component inside dangerouslySetInnerHTML
                            
                                Error in establishing websocket connection
                            
                                Error with JSX in my React Library when Switching to Preact
                            
                                Draggable table with bootstrap vue
                            
                                How to use key-value object as datasource for Angular Material table
                            
                                Why is npm run serve building slow on Vue.js?
                            
                                Access C# List from JavaScript
                            
                                Why does this JavaScript code run slower after Node.js optimization
                            
                                Popover component - onExited callback doesn't work , material ui
                            
                                Gmail Add-On: Oauth not being triggered
                            
                                Javascript AppendChild Issue
                            
                                Run function AFTER route fully rendered in Nuxt.js
                            
                                NodeJS: TypeError: Cannot read property 'json' of undefined
                            
                                How would I ask the user for a list of names?
                            
                                Apple iOS browsers randomly won't render HTML objects loaded dynamically
                            
                                Retrieve media info from the current track being played on a specific Chromecast
                            
                                How to get all Google Calendar Goals and Reminders that are marked as DONE

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With