Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

BigQuery: Load from CSV, skip columns

Say I have a table with existing data, with a schema like:

{ 'name' : 'Field1', 'type' : 'STRING' },
{ 'name' : 'Field2', 'type' : 'STRING' }

Our data is CSV:

Field1,Field2
Value1,Value2
...

We load data by creating a new job, loading a CSV directly from Google Cloud Storage (GCS). Our data files now have an additional column and different ordering, such that the data is now structured:

Field1,Field3,Field2
Value1,Value3,Value2
...

Is there a way to specify in the load job that we would like to skip the second column, and only load columns 1 and 3 (named Field1 and Field2)?

I am using the Python API e.g., service.jobs().insert(job_body)

Basically I want to do something like this:

job_body = {
  'projectId': projectId,
  'configuration': {
      'load': {
        'sourceUris': [sourceCSV],
        'schema': {
          'fields': [
            {
              'name': 'Field1',
              'type': 'STRING'
            },
            { # this would be the skipped field
              'name': None
              'skip': True
            },
            {
              'name': 'Field2',
              'type': 'String'
            },
          ]
        },
        'destinationTable': {
          'projectId': projectId,
          'datasetId': datasetId,
          'tableId': targetTableId
        },
      }
    }
  }

Thanks!

like image 215
Kevin S. Avatar asked Sep 08 '14 23:09

Kevin S.


2 Answers

Felipe's suggestion should work. Another possibility, if you're able to modify the CSV you're loading into BigQuery, would be the ignoreUnknownValues flag on load jobs:

[Optional] Accept rows that contain values that do not match the schema. The unknown values are ignored. Default is false which treats unknown values as errors. For CSV this ignores extra values at the end of a line. For JSON this ignores named values that do not match any column name.

Using this flag would, however, require reordering the columns in your CSV or formatting your data as JSON.

like image 134
Danny Kitt Avatar answered Sep 18 '22 15:09

Danny Kitt


It's not currently possible to do that, but it could be an interesting feature request. Feel free to add it to https://code.google.com/p/google-bigquery/issues/list.

In the meantime, I would do a 2 step import:

  1. Import as a new table with 3 columns.
  2. Append "SELECT column1, column2 FROM [newtable]" into the existing table.
like image 35
Felipe Hoffa Avatar answered Sep 19 '22 15:09

Felipe Hoffa