I currently have an app written in appscript to import some CSV files from cloud storage into bigquery. While this is pretty simple, I am forced to specify the schema for the destination table.
What I am looking for is a way to read the CSV file and create the schema based on the column names in the first row. It is okay if all the variable types end up as strings. I feel like this is a pretty common scenario.. does anyone have any guidance on this?
Much thanks, Nick
When you make the select you can pass the order you want for your columns. just do it as SELECT col1, col2, col3 FROM ...
To enable schema auto-detection when loading data, use one of these approaches: In the Google Cloud console, in the Schema section, for Auto detect, check the Schema and input parameters option. In the bq command-line tool, use the bq load command with the --autodetect parameter.
One option (not a particularly pleasant one, but an option) would be to make a raw HTTP request from apps script to GCS to read the first row of the data, split it on commas, and generate a schema from that. GCS doesn't have apps script integration, so you need to build the requests by hand. Apps Script does have some utilities to let you do this (as well as OAuth), but my guess is that is is going to be a decent amount of work to get right.
There are also a couple of things you could try from the BigQuery side. You could import the data to a temporary table as a single field (set the field delimiter to something that doesn't exist, like '\r'). You can read the header row via tabledata.list() (i.e. the first row of the temporary table). You can then run a query that splits up then split the single field up into columns with a regular expression, and set allow_large_results and a destination table.
One other option would be to use a dummy schema with more columns than you'll ever have, then use the allow_jagged_rows option to allow rows that are missing data at the end of the row. You can then read the first row (similar to the previous option) with tabledata.list() and figure out how many rows are actually present. Then you could generate a query that rewrites the table with correct column names. The advantage of this approach is that you don't need regular expressions or parsing; it lets bigquery do all of the CSV parsing.
There is a downside to both of the latter two approaches, however; the bigquery load mechanism does not guarantee to preserve ordering of your data. In practice, the first row should always be the first row in the table, but that isn't guaranteed to always be true.
Sorry there isn't a better solution. We've had a feature request on the table for a long time to auto-infer schemas; I'll take this as another vote for it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With