Say I have a table with existing data, with a schema like:
{ 'name' : 'Field1', 'type' : 'STRING' },
{ 'name' : 'Field2', 'type' : 'STRING' }
Our data is CSV:
Field1,Field2
Value1,Value2
...
We load data by creating a new job, loading a CSV directly from Google Cloud Storage (GCS). Our data files now have an additional column and different ordering, such that the data is now structured:
Field1,Field3,Field2
Value1,Value3,Value2
...
Is there a way to specify in the load job that we would like to skip the second column, and only load columns 1 and 3 (named Field1 and Field2)?
I am using the Python API e.g., service.jobs().insert(job_body)
Basically I want to do something like this:
job_body = {
'projectId': projectId,
'configuration': {
'load': {
'sourceUris': [sourceCSV],
'schema': {
'fields': [
{
'name': 'Field1',
'type': 'STRING'
},
{ # this would be the skipped field
'name': None
'skip': True
},
{
'name': 'Field2',
'type': 'String'
},
]
},
'destinationTable': {
'projectId': projectId,
'datasetId': datasetId,
'tableId': targetTableId
},
}
}
}
Thanks!
Felipe's suggestion should work. Another possibility, if you're able to modify the CSV you're loading into BigQuery, would be the ignoreUnknownValues flag on load jobs:
[Optional] Accept rows that contain values that do not match the schema. The unknown values are ignored. Default is false which treats unknown values as errors. For CSV this ignores extra values at the end of a line. For JSON this ignores named values that do not match any column name.
Using this flag would, however, require reordering the columns in your CSV or formatting your data as JSON.
It's not currently possible to do that, but it could be an interesting feature request. Feel free to add it to https://code.google.com/p/google-bigquery/issues/list.
In the meantime, I would do a 2 step import:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With