Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to import CSV to BigQuery using columns names from first row?

I currently have an app written in appscript to import some CSV files from cloud storage into bigquery. While this is pretty simple, I am forced to specify the schema for the destination table.

What I am looking for is a way to read the CSV file and create the schema based on the column names in the first row. It is okay if all the variable types end up as strings. I feel like this is a pretty common scenario.. does anyone have any guidance on this?

Much thanks, Nick

like image 332
ntsue Avatar asked Feb 15 '14 00:02

ntsue


People also ask

How do I change the order of columns in BigQuery?

When you make the select you can pass the order you want for your columns. just do it as SELECT col1, col2, col3 FROM ...

How do I auto detect schema in BigQuery?

To enable schema auto-detection when loading data, use one of these approaches: In the Google Cloud console, in the Schema section, for Auto detect, check the Schema and input parameters option. In the bq command-line tool, use the bq load command with the --autodetect parameter.


1 Answers

One option (not a particularly pleasant one, but an option) would be to make a raw HTTP request from apps script to GCS to read the first row of the data, split it on commas, and generate a schema from that. GCS doesn't have apps script integration, so you need to build the requests by hand. Apps Script does have some utilities to let you do this (as well as OAuth), but my guess is that is is going to be a decent amount of work to get right.

There are also a couple of things you could try from the BigQuery side. You could import the data to a temporary table as a single field (set the field delimiter to something that doesn't exist, like '\r'). You can read the header row via tabledata.list() (i.e. the first row of the temporary table). You can then run a query that splits up then split the single field up into columns with a regular expression, and set allow_large_results and a destination table.

One other option would be to use a dummy schema with more columns than you'll ever have, then use the allow_jagged_rows option to allow rows that are missing data at the end of the row. You can then read the first row (similar to the previous option) with tabledata.list() and figure out how many rows are actually present. Then you could generate a query that rewrites the table with correct column names. The advantage of this approach is that you don't need regular expressions or parsing; it lets bigquery do all of the CSV parsing.

There is a downside to both of the latter two approaches, however; the bigquery load mechanism does not guarantee to preserve ordering of your data. In practice, the first row should always be the first row in the table, but that isn't guaranteed to always be true.

Sorry there isn't a better solution. We've had a feature request on the table for a long time to auto-infer schemas; I'll take this as another vote for it.

like image 195
Jordan Tigani Avatar answered Sep 16 '22 17:09

Jordan Tigani