Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pros/cons of streaming into BigQuery directly vs through Google Pub/Sub + Dataflow

Tags:

We have a NodeJS API hosted on Google Kubernetes Engine, and we'd like to start logging events into BigQuery.

I can see 3 different ways of doing that :

  1. Insert each event directly into BigQuery using the Node BigQuery SDK in the API (as described here under "Streaming Insert Examples" : https://cloud.google.com/bigquery/streaming-data-into-bigquery or here : https://github.com/googleapis/nodejs-bigquery/blob/7d7ead644e1b9fe8428462958dbc9625fe6c99c8/samples/tables.js#L367 )
  2. Publish each event to a Cloud Pub/Sub topic, then writing a Cloud Dataflow pipeline to stream that to BigQuery (in Java or Python only it seems) , like here https://blog.doit-intl.com/replacing-mixpanel-with-bigquery-dataflow-and-kubernetes-b5f844710674 or here https://github.com/bomboradata/pubsub-to-bigquery
  3. Publish each event to a Pub/Sub topic from the API, but instead of Dataflow use a custom worker process that subscribes to the Pub/Sub topic on one side and streams into BQ on the other. Like here : https://github.com/GoogleCloudPlatform/kubernetes-bigquery-python/blob/master/pubsub/pubsub-pipe-image/pubsub-to-bigquery.py or here : https://github.com/mchon89/Google_PubSub_BigQuery/blob/master/pubsub_to_bigquery.py

For this particular use case, we don't need to do any transforms and will just send events straight into the right format. But we may later have other use cases where we'll need to sync tables from our main datastore (MySQL) into BQ for analytics, so maybe starting with Dataflow straight away is worth it ?

A few questions :

  • Option 1 (sending single event straight to BQ) seems simplest if you don't have any transforms to do. Is it just as fast and reliable as publishing to a Pub/Sub topic ? I'm mainly concerned about latency and error/duplication handling (https://cloud.google.com/bigquery/troubleshooting-errors#streaming). Maybe this is better done in a separate process ?
  • For Option 2, are there any Dataflow "presets" that don't require you to write custom code when all you need is to read from Pub/Sub + send reliably into BQ with no transforms (maybe just deduplication / error handling)
  • Are there any drawbacks to having a simple custom worker (option 3) that reads from Pub/Sub then streams into BQ and does all error handling / retrying etc ?
like image 259
renaudg Avatar asked Jan 11 '18 18:01

renaudg


People also ask

Can you stream data to BigQuery?

To stream data into BigQuery, you need the following IAM permissions: bigquery. tables. updateData (lets you insert data into the table)

What are the benefits of dataflow streaming engine select all that apply?

Benefits of Dataflow ShuffleFaster execution time of batch pipelines for the majority of pipeline job types. A reduction in consumed CPU, memory, and Persistent Disk storage resources on the worker VMs. Better autoscaling since VMs no longer hold any shuffle data and can therefore be scaled down earlier.

Does BigQuery support streaming inserts?

The user sends a streaming insert into BigQuery via the tabledata. insertAll method. This insert is sent to the API in JSON format, along with other details such as authorization headers and details about the intended destination. A single insertAll call may have one or more individual records within it.

Can Pubsub write to BigQuery?

It provides a simplified pipeline development environment that uses the Apache Beam SDK to transform incoming data and then output the transformed data. If you want to write messages to BigQuery directly, without configuring Dataflow to provide data transformation, use a Pub/Sub BigQuery subscription.


2 Answers

For Option 2, Yes there is a preset called a Google-provided Template that facilitates movement of data from PubSub to BigQuery without having to write any code.

You can learn more about how to use this Google-provided Template, and others, in the Cloud Dataflow documentation.

like image 78
Andrew Mo Avatar answered Sep 17 '22 15:09

Andrew Mo


Another option is to export the logs using a log sink. Right from the Stackdriver Logging UI, you can specify BigQuery (or other destinations) for your logs. Since your Node API is running in Kubernetes, you just need to log messages to stdout and they'll automatically get written to Stackdriver.

Reference: https://cloud.google.com/logging/docs/export/configure_export_v2

like image 41
Andrew Nguonly Avatar answered Sep 17 '22 15:09

Andrew Nguonly