We have a NodeJS API hosted on Google Kubernetes Engine, and we'd like to start logging events into BigQuery. I can see 3 different ways of doing that : <ol> <li> Insert each event directly into BigQuery using the Node BigQuery SDK in the API (as described here under "Streaming Insert Examples" : https://cloud.google.com/bigquery/streaming-data-into-bigquery or here : https://github.com/googleapis/nodejs-bigquery/blob/7d7ead644e1b9fe8428462958dbc9625fe6c99c8/samples/tables.js#L367 )</li> <li> Publish each event to a Cloud Pub/Sub topic, then writing a Cloud Dataflow pipeline to stream that to BigQuery (in Java or Python only it seems) , like here https://blog.doit-intl.com/replacing-mixpanel-with-bigquery-dataflow-and-kubernetes-b5f844710674 or here https://github.com/bomboradata/pubsub-to-bigquery </li> <li> Publish each event to a Pub/Sub topic from the API, but instead of Dataflow use a custom worker process that subscribes to the Pub/Sub topic on one side and streams into BQ on the other. Like here : https://github.com/GoogleCloudPlatform/kubernetes-bigquery-python/blob/master/pubsub/pubsub-pipe-image/pubsub-to-bigquery.py or here : https://github.com/mchon89/Google_PubSub_BigQuery/blob/master/pubsub_to_bigquery.py </li> </ol> For this particular use case, we don't need to do any transforms and will just send events straight into the right format. But we may later have other use cases where we'll need to sync tables from our main datastore (MySQL) into BQ for analytics, so maybe starting with Dataflow straight away is worth it ? A few questions : <ul> <li>Option 1 (sending single event straight to BQ) seems simplest if you don't have any transforms to do. Is it just as fast and reliable as publishing to a Pub/Sub topic ? I'm mainly concerned about latency and error/duplication handling (https://cloud.google.com/bigquery/troubleshooting-errors#streaming). Maybe this is better done in a separate process ?</li> <li>For Option 2, are there any Dataflow "presets" that don't require you to write custom code when all you need is to read from Pub/Sub + send reliably into BQ with no transforms (maybe just deduplication / error handling)</li> <li>Are there any drawbacks to having a simple custom worker (option 3) that reads from Pub/Sub then streams into BQ and does all error handling / retrying etc ?</li> </ul>

For Option 2, Yes there is a preset called a Google-provided Template that facilitates movement of data from PubSub to BigQuery without having to write any code. You can learn more about how to use this Google-provided Template, and others, in the Cloud Dataflow documentation.

Pros/cons of streaming into BigQuery directly vs through Google Pub/Sub + Dataflow

Tags:

We have a NodeJS API hosted on Google Kubernetes Engine, and we'd like to start logging events into BigQuery.

I can see 3 different ways of doing that :

Insert each event directly into BigQuery using the Node BigQuery SDK in the API (as described here under "Streaming Insert Examples" : https://cloud.google.com/bigquery/streaming-data-into-bigquery or here : https://github.com/googleapis/nodejs-bigquery/blob/7d7ead644e1b9fe8428462958dbc9625fe6c99c8/samples/tables.js#L367 )
Publish each event to a Cloud Pub/Sub topic, then writing a Cloud Dataflow pipeline to stream that to BigQuery (in Java or Python only it seems) , like here https://blog.doit-intl.com/replacing-mixpanel-with-bigquery-dataflow-and-kubernetes-b5f844710674 or here https://github.com/bomboradata/pubsub-to-bigquery
Publish each event to a Pub/Sub topic from the API, but instead of Dataflow use a custom worker process that subscribes to the Pub/Sub topic on one side and streams into BQ on the other. Like here : https://github.com/GoogleCloudPlatform/kubernetes-bigquery-python/blob/master/pubsub/pubsub-pipe-image/pubsub-to-bigquery.py or here : https://github.com/mchon89/Google_PubSub_BigQuery/blob/master/pubsub_to_bigquery.py

For this particular use case, we don't need to do any transforms and will just send events straight into the right format. But we may later have other use cases where we'll need to sync tables from our main datastore (MySQL) into BQ for analytics, so maybe starting with Dataflow straight away is worth it ?

A few questions :

Option 1 (sending single event straight to BQ) seems simplest if you don't have any transforms to do. Is it just as fast and reliable as publishing to a Pub/Sub topic ? I'm mainly concerned about latency and error/duplication handling (https://cloud.google.com/bigquery/troubleshooting-errors#streaming). Maybe this is better done in a separate process ?
For Option 2, are there any Dataflow "presets" that don't require you to write custom code when all you need is to read from Pub/Sub + send reliably into BQ with no transforms (maybe just deduplication / error handling)
Are there any drawbacks to having a simple custom worker (option 3) that reads from Pub/Sub then streams into BQ and does all error handling / retrying etc ?

259

asked Jan 11 '18 18:01

renaudg

2 Answers

For Option 2, Yes there is a preset called a Google-provided Template that facilitates movement of data from PubSub to BigQuery without having to write any code.

You can learn more about how to use this Google-provided Template, and others, in the Cloud Dataflow documentation.

answered Sep 17 '22 15:09

Andrew Mo

Another option is to export the logs using a log sink. Right from the Stackdriver Logging UI, you can specify BigQuery (or other destinations) for your logs. Since your Node API is running in Kubernetes, you just need to log messages to stdout and they'll automatically get written to Stackdriver.

Reference: https://cloud.google.com/logging/docs/export/configure_export_v2

answered Sep 17 '22 15:09

Andrew Nguonly

Related questions
                            
                                <link rel=preload> must have a valid `as` value
                            
                                How do whatsapp and instant messaging apps work in background without persistent notification in Oreo?
                            
                                Android Architecture navigation Component with Bottom Navigation?
                            
                                Replacement for wsimport with JDK 11
                            
                                What is the difference between Flutter packages widgets.dart, material.dart and cupertino.dart and which one to use?
                            
                                Displaying popup windows while running in the background?
                            
                                KeyboardView is deprecated in android
                            
                                Integrating Visual Studio Test Project with Cruise Control
                            
                                Execute JavaScript from within a C# assembly
                            
                                Do you have a common base class for Hibernate entities?
                            
                                Hibernate: Collections of Collections
                            
                                Secure Debugging for Production JVMs

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With