Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Google BigQuery Python library is 2x as fast as the Node JS library for downloading results

I've been doing a test to compare the speeds at which the Google BigQuery Python client library downloads query results compared to the Node JS library. It would seem that, out-of-the-box, the Python libraries download data about twice as fast as the Javascript Node JS client. Why is that so?

Below I provide the two tests, one in Python and one in Javascript. I've selected the usa_names public dataset of BigQuery as an example. The usa_1910_current table in this dataset is about 6 million rows and about 180Mb in size. I have a 200Mb fibre download link (for information about the last mile). The data, after being packed into a pandas dataframe, is about 1.1Gb (with Pandas overhead included).

Python test

from google.cloud import bigquery
import time
import pandas as pd

bq_client = bigquery.Client("mydata-1470162410749")

sql = """SELECT * FROM `bigquery-public-data.usa_names.usa_1910_current`"""

job_config = bigquery.QueryJobConfig()

start = time.time()
#---------------------------------------------------
query_job = bq_client.query(
    sql,
    location='US',
    job_config=job_config)  
#--------------------------------------------------- 
end = time.time()
query_time = end-start

start = time.time()
#---------------------------------------------------
rows = list(query_job.result(timeout=30))
df = pd.DataFrame(data=[list(x.values()) for x in rows], columns=list(rows[0].keys()))
#---------------------------------------------------    
end = time.time()

iteration_time = end-start
dataframe_size_mb = df.memory_usage(deep=True).sum() / 1024 ** 2
print("Size of the data in Mb: " + str(dataframe_size_mb) + " Mb")
print("Shape of the dataframe: " + str(df.shape))
print("Request time:", query_time)
print("Fetch time:", iteration_time)

Node JS test

// Import the Google Cloud client library
const {BigQuery} = require('@google-cloud/bigquery');
const moment = require('moment')

async function query() {

  const bigqueryClient = new BigQuery();
  const query = "SELECT * FROM `bigquery-public-data.usa_names.usa_1910_current`";
  const options = {
    query: query,
    location: 'US',
  };

  // Run the query as a job
  const [job] = await bigqueryClient.createQueryJob(options);
  console.log(`Job ${job.id} started.`);

  // Wait for the query to finish
  let startTime = moment.utc()
  console.log('Start: ', startTime.format("YYYY-MM-DD HH:mm:ss"));
  const [rows] = await job.getQueryResults();
  let endTime = moment.utc()
  console.log('End: ', endTime.format("YYYY-MM-DD HH:mm:ss"));
  console.log('Difference (s): ', endTime.diff(startTime) / 1000)
}

query();

Python library test results with 180Mb of data:

  • Size of the data in Mb: 1172.0694370269775 Mb
  • Shape of the dataframe: (6028151, 5)
  • Request time: 3.58441424369812
  • Fetch time: 388.0966112613678 <-- This is 6.46 mins

Node JS library test results with 180Mb of data:

  • Start: 2019-06-03 19:11:03
  • End: 2019-06-03 19:24:12 <- About 13 mins

For further reference, I also ran the tests against a 2Gb table...

Python library test results with 2Gb of data:

  • Size of the data in Mb: 3397.0339670181274 Mb
  • Shape of the dataframe: (1278004, 21)
  • Request time: 2.4991791248321533
  • Fetch time: 867.7270500659943 <-- This is 14.45mins

Node JS library test results with 2Gb of data:

  • Start: 2019-06-03 15:30:59
  • End: 2019-06-03 16:02:49 <-- The difference is just below 31 mins
like image 906
Eben du Toit Avatar asked Jun 03 '19 19:06

Eben du Toit


People also ask

Why is BigQuery fast?

1. Columnar Storage : Data is stored in a columnar storage fashion which makes possible to achieve a very high compression ratio and scan throughput. 2. Tree Architecture : is used for dispatching queries and aggregating results across thousands of machines in a few seconds.

Does BigQuery support streaming inserts?

Streaming is not available through the free tier. If you attempt to use streaming without enabling billing, you receive the following error: BigQuery: Streaming insert is not allowed in the free tier.

What is BigQuery streaming?

BigQuery streaming ingestion allows you to stream your data into BigQuery one record at a time by using the tabledata. insertAll method. The API allows uncoordinated inserts from multiple producers.


1 Answers

As I can see Node JS uses pagination to manage the datasets while Python with looks like it brings the entire datasets and start to work with it.

This is maybe affecting the performance of the Node JS client library, my recommendation is to take a look at the source code of both clients and read constantly the Google Cloud Blog, where Google publishes sometimes tips and best practices to use their products, as an example this article: Testing Cloud Pub/Sub clients to maximize streaming performance.

like image 83
Enrique Zetina Avatar answered Oct 05 '22 18:10

Enrique Zetina