How to import CSV to BigQuery using columns names from first row?

Tags:

google-bigquery

I currently have an app written in appscript to import some CSV files from cloud storage into bigquery. While this is pretty simple, I am forced to specify the schema for the destination table.

What I am looking for is a way to read the CSV file and create the schema based on the column names in the first row. It is okay if all the variable types end up as strings. I feel like this is a pretty common scenario.. does anyone have any guidance on this?

Much thanks, Nick

332

asked Feb 15 '14 00:02

ntsue

1 Answers

One option (not a particularly pleasant one, but an option) would be to make a raw HTTP request from apps script to GCS to read the first row of the data, split it on commas, and generate a schema from that. GCS doesn't have apps script integration, so you need to build the requests by hand. Apps Script does have some utilities to let you do this (as well as OAuth), but my guess is that is is going to be a decent amount of work to get right.

There are also a couple of things you could try from the BigQuery side. You could import the data to a temporary table as a single field (set the field delimiter to something that doesn't exist, like '\r'). You can read the header row via tabledata.list() (i.e. the first row of the temporary table). You can then run a query that splits up then split the single field up into columns with a regular expression, and set allow_large_results and a destination table.

One other option would be to use a dummy schema with more columns than you'll ever have, then use the allow_jagged_rows option to allow rows that are missing data at the end of the row. You can then read the first row (similar to the previous option) with tabledata.list() and figure out how many rows are actually present. Then you could generate a query that rewrites the table with correct column names. The advantage of this approach is that you don't need regular expressions or parsing; it lets bigquery do all of the CSV parsing.

There is a downside to both of the latter two approaches, however; the bigquery load mechanism does not guarantee to preserve ordering of your data. In practice, the first row should always be the first row in the table, but that isn't guaranteed to always be true.

Sorry there isn't a better solution. We've had a feature request on the table for a long time to auto-infer schemas; I'll take this as another vote for it.

195

answered Sep 16 '22 17:09

Jordan Tigani

Related questions
                            
                                BigQuery vs Cloud SQL for dashboards backend
                            
                                How to query json stored as string in bigquery table?
                            
                                md5 in bigquery
                            
                                Finding Substring in Bigquery
                            
                                Google Bigquery BQ command line execute query from a file
                            
                                HAVERSINE distance in BigQuery?
                            
                                BigQuery Wildcard using TABLE_DATE_RANGE()
                            
                                Format a number to have commas (1000000 -> 1,000,000)
                            
                                ImportError: cannot import name pubsub_v1
                            
                                Writing different values to different BigQuery tables in Apache Beam
                            
                                Google BigQuery: how to create a new column with SQL
                            
                                How to convert csv into a dictionary in apache beam dataflow
                            
                                Google BigQuery CASE function
                            
                                BigQuery can't create view with union tables containing TIMESTAMP fields
                            
                                When does BigQuery flush the streaming output buffer
                            
                                Left join fails if not explicitly using ISNULL
                            
                                BigQuery connector for pyspark via Hadoop Input Format example
                            
                                How to retrieve huge (>2000) amount of entities from GAE datastore in under 1 second?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With