Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Need help creating schema for loading CSV into BigQuery

I am trying to load some CSV files into BigQuery from Google Cloud Storage and wrestling with schema generation. There is an auto-generate option but it is poorly documented. The problem is that if I choose to let BigQuery generate the schema, it does a decent job of guessing data types, but only sometimes does it recognizes the first row of the data as a header row, and sometimes it does not (treats the 1st row as data and generates column names like string_field_N). The first rows of my data are always header rows. Some of the tables have many columns (over 30), and I do not want to mess around with schema syntax because BigQuery always bombs with an uninformative error message when something (I have no idea what) is wrong with the schema.

So: How can I force it to recognize the first row as a header row? If that isn't possible, how do I get it to spit out the schema it generated in the proper syntax so that I can edit it (for appropriate column names) and use that as the schema on import?

like image 913
Bill Rosenblatt Avatar asked Apr 12 '26 14:04

Bill Rosenblatt


1 Answers

I would recommend doing 2 things here:

  1. Preprocess your file and store the final layout of the file sans the first row i.e. the header row
  2. BQ load accepts an additional parameter in form of a JSON schema file, use this to explicitly define the table schema and pass this file as a parameter. This allows you the flexibility to alter schema at any point in time, if required

Allowing BQ to autodetect schema is not advised.

like image 161
Raunak Jhawar Avatar answered Apr 15 '26 04:04

Raunak Jhawar