Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Create a BigQuery table from pandas dataframe, WITHOUT specifying schema explicitly

I have a pandas dataframe and want to create a BigQuery table from it. I understand that there are many posts asking about this question, but all the answers I can find so far require explicitly specifying the schema of every column. For example:

from google.cloud import bigquery as bq

client = bq.Client()

dataset_ref = client.dataset('my_dataset', project = 'my_project')
table_ref = dataset_ref.table('my_table')  

job_config = bq.LoadJobConfig( 
 schema=[ 
     bq.SchemaField("a", bq.enums.SqlTypeNames.STRING),
     bq.SchemaField("b", bq.enums.SqlTypeNames.INT64), 
     bq.SchemaField("c", bq.enums.SqlTypeNames.FLOAT64),         
 ]
) 

client.load_table_from_dataframe(my_df, table_ref, job_config=job_config).result()

However, sometimes I have a dataframe of many columns (for example, 100 columns), it's really non-trival to specify all the columns. Is there a way to do it efficiently?

Btw, I found this post with similar question: Efficiently write a Pandas dataframe to Google BigQuery But seems like bq.Schema.from_dataframe does not exist:

AttributeError: module 'google.cloud.bigquery' has no attribute 'Schema'
like image 357
user2830451 Avatar asked Jul 31 '20 23:07

user2830451


People also ask

How do I change the table schema in BigQuery?

In the Google Cloud console, go to the BigQuery page. In the Explorer panel, expand your project and dataset, then select the table. In the details panel, click the Schema tab. Click Edit schema.


1 Answers

Here's a code snippet to load a DataFrame to BQ:

import pandas as pd
from google.cloud import bigquery

# Example data
df = pd.DataFrame({'a': [1,2,4], 'b': ['123', '456', '000']})

# Load client
client = bigquery.Client(project='your-project-id')

# Define table name, in format dataset.table_name
table = 'your-dataset.your-table'

# Load data to BQ
job = client.load_table_from_dataframe(df, table)

If you want to specify only a subset of the schema and still import all the columns, you can switch the last row with

# Define a job config object, with a subset of the schema
job_config = bigquery.LoadJobConfig(schema=[bigquery.SchemaField('b', 'STRING')])

# Load data to BQ
job = client.load_table_from_dataframe(df, table, job_config=job_config)
like image 109
Matteo Felici Avatar answered Sep 30 '22 12:09

Matteo Felici