I've been doing a test to compare the speeds at which the Google BigQuery Python client library downloads query results compared to the Node JS library. It would seem that, out-of-the-box, the Python libraries download data about twice as fast as the Javascript Node JS client. Why is that so?
Below I provide the two tests, one in Python and one in Javascript.
I've selected the usa_names
public dataset of BigQuery as an example. The usa_1910_current
table in this dataset is about 6 million rows and about 180Mb
in size. I have a 200Mb fibre download link (for information about the last mile). The data, after being packed into a pandas dataframe, is about 1.1Gb (with Pandas overhead included).
Python test
from google.cloud import bigquery
import time
import pandas as pd
bq_client = bigquery.Client("mydata-1470162410749")
sql = """SELECT * FROM `bigquery-public-data.usa_names.usa_1910_current`"""
job_config = bigquery.QueryJobConfig()
start = time.time()
#---------------------------------------------------
query_job = bq_client.query(
sql,
location='US',
job_config=job_config)
#---------------------------------------------------
end = time.time()
query_time = end-start
start = time.time()
#---------------------------------------------------
rows = list(query_job.result(timeout=30))
df = pd.DataFrame(data=[list(x.values()) for x in rows], columns=list(rows[0].keys()))
#---------------------------------------------------
end = time.time()
iteration_time = end-start
dataframe_size_mb = df.memory_usage(deep=True).sum() / 1024 ** 2
print("Size of the data in Mb: " + str(dataframe_size_mb) + " Mb")
print("Shape of the dataframe: " + str(df.shape))
print("Request time:", query_time)
print("Fetch time:", iteration_time)
Node JS test
// Import the Google Cloud client library
const {BigQuery} = require('@google-cloud/bigquery');
const moment = require('moment')
async function query() {
const bigqueryClient = new BigQuery();
const query = "SELECT * FROM `bigquery-public-data.usa_names.usa_1910_current`";
const options = {
query: query,
location: 'US',
};
// Run the query as a job
const [job] = await bigqueryClient.createQueryJob(options);
console.log(`Job ${job.id} started.`);
// Wait for the query to finish
let startTime = moment.utc()
console.log('Start: ', startTime.format("YYYY-MM-DD HH:mm:ss"));
const [rows] = await job.getQueryResults();
let endTime = moment.utc()
console.log('End: ', endTime.format("YYYY-MM-DD HH:mm:ss"));
console.log('Difference (s): ', endTime.diff(startTime) / 1000)
}
query();
For further reference, I also ran the tests against a 2Gb table...
1. Columnar Storage : Data is stored in a columnar storage fashion which makes possible to achieve a very high compression ratio and scan throughput. 2. Tree Architecture : is used for dispatching queries and aggregating results across thousands of machines in a few seconds.
Streaming is not available through the free tier. If you attempt to use streaming without enabling billing, you receive the following error: BigQuery: Streaming insert is not allowed in the free tier.
BigQuery streaming ingestion allows you to stream your data into BigQuery one record at a time by using the tabledata. insertAll method. The API allows uncoordinated inserts from multiple producers.
As I can see Node JS uses pagination to manage the datasets while Python with looks like it brings the entire datasets and start to work with it.
This is maybe affecting the performance of the Node JS client library, my recommendation is to take a look at the source code of both clients and read constantly the Google Cloud Blog, where Google publishes sometimes tips and best practices to use their products, as an example this article: Testing Cloud Pub/Sub clients to maximize streaming performance.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With