Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pandas to gbq claims a schema mismatch while the schema's are exactly the same. On github all the issues are claimed to have been solved in 2017

I am trying to append a table to a different table through pandas, pulling the data from BigQuery and sending it to a different BigQuery dataset. While the table schema is exactly the same i get the error " "Please verify that the structure and " pandas_gbq.gbq.InvalidSchema: Please verify that the structure and data types in the DataFrame match the schema of the destination table."

This error occurred earlier where I went for table overwrites but in this case the datasets are too large to do that (and that is not a sustainable solution).


df = pd.read_gbq(query, project_id="my-project", credentials=bigquery_key,
                 dialect='standard')
pd.io.gbq.to_gbq(df, dataset, projectid,
                 if_exists='append',
                 table_schema=[{'name': 'Date','type': 'STRING'},
                               {'name': 'profileId','type': 'STRING'},
                               {'name': 'Opco','type': 'STRING'},
                               {'name': 'country','type': 'STRING'},
                               {'name': 'deviceType','type': 'STRING'},
                               {'name': 'userType','type': 'STRING'},
                               {'name': 'users','type': 'INTEGER'},
                               {'name': 'sessions','type': 'INTEGER'},
                               {'name': 'bounceRate','type': 'FLOAT'},
                               {'name': 'sessionsPerUser','type': 'FLOAT'},
                               {'name': 'avgSessionDuration','type': 'FLOAT'},
                               {'name': 'pageviewsPerSession','type': 'FLOAT'}
                               ],
                 credentials=bigquery_key)

The schema in BigQuery is as follows:

Date                STRING      
profileId           STRING  
Opco                STRING  
country             STRING  
deviceType          STRING  
userType            STRING  
users               INTEGER 
sessions            INTEGER 
bounceRate          FLOAT   
sessionsPerUser     FLOAT   
avgSessionDuration  FLOAT   
pageviewsPerSession FLOAT   

I then get the following error:

Traceback (most recent call last):   File "..file.py", line 63, in
<module>
    main()
  File "..file.py", line 57, in main
    updating_general_data(bigquery_key)
  File "..file.py", line 46, in updating_general_data
    credentials=bigquery_key)
  File
"..\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas\io\gbq.py",
line 162, in to_gbq
    credentials=credentials, verbose=verbose, private_key=private_key)
  File
"..\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pandas_gbq\gbq.py",
line 1141, in to_gbq
     "Please verify that the structure and " pandas_gbq.gbq.InvalidSchema: Please verify that the structure and
data types in the DataFrame match the schema of the destination table.

To me it seems that there is a 1 on 1 match. I've seen other threads talk about this and these threads are mainly talking about date formats even though the date format is already a string in this case and is then with the table_schema still made as string.

like image 248
Joost Schimmel Avatar asked Jun 11 '19 14:06

Joost Schimmel


1 Answers

Ultimate solution to this is instead of manually specifying schema which can always be prone to type casting/naming errors, Its always best top get the schema from the very table. So have a client using latest version of API:

from google.cloud import bigquery
from google.oauth2 import service_account
credentials = service_account.Credentials.from_service_account_file(
    'credentials.json')
project_id = 'your_project_id',
client = bigquery.Client(credentials= credentials,project=project_id)

Get the table you want to write/append to:

table = client.get_table('your_dataset.your_table')
table

Generate schema from the table:

generated_schema = [{'name':i.name, 'type':i.field_type} for i in table.schema]
generated_schema

Rename your dataframe accordingly:

data.columns = [i.name for i in table.schema]

Pass the same schema while pushing it to BigQuery:

data.to_gbq(project_id = 'your_project_id',
                    destination_table = 'your_dataset.your_table',
                    credentials = service_account.Credentials.from_service_account_file(
                        'credentials.json'),
                    table_schema = generated_schema,
                    progress_bar = True,
                    if_exists = 'replace')
like image 73
Hamza Avatar answered Oct 17 '22 08:10

Hamza