Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Autodetect BigQuery schema within Dataflow?

Is it possible to use the equivalent of --autodetect in DataFlow?

i.e. can we load data into a BQ table without specifying a schema, equivalent to how we can load data from a CSV with --autodetect?

(potentially related question)

like image 842
Maximilian Avatar asked Feb 04 '23 13:02

Maximilian


1 Answers

If you are using protocol buffers as objects in your PCollections (which should be performing very well on the Dataflow back-end) you might be able to use a util I wrote in the past. It will parse the schema of the protobuffer into a BigQuery schema at runtime, based on inspection of the protobuffer descriptor.

I quickly uploaded it to GitHub, it's WIP, but you might be able to use it or be inspired to write something similar using Java Reflection (I might do it myself at some point).

You can use the util as follows:

TableSchema schema = ProtobufUtils.makeTableSchema(ProtobufClass.getDescriptor());
enhanced_events.apply(BigQueryIO.Write.to(tableToWrite).withSchema(schema)
            .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
            .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE));

where the create disposition will create the table with the schema specified and the ProtobufClass is the class generated using your Protobuf schema and the proto compiler.

like image 136
Matthias Baetens Avatar answered Feb 22 '23 23:02

Matthias Baetens