Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas/Google BigQuery: Schema mismatch makes the upload fail

The schema in my google table looks like this:

price_datetime : DATETIME,
symbol         : STRING,
bid_open       : FLOAT,
bid_high       : FLOAT,
bid_low        : FLOAT,
bid_close      : FLOAT,
ask_open       : FLOAT,
ask_high       : FLOAT,
ask_low        : FLOAT,
ask_close      : FLOAT

After I do a pandas.read_gbq I get a dataframe with column dtypes like this:

price_datetime     object
symbol             object
bid_open          float64
bid_high          float64
bid_low           float64
bid_close         float64
ask_open          float64
ask_high          float64
ask_low           float64
ask_close         float64
dtype: object

Now I want to use to_gbq so I convert my local dataframe (which I just made) from these dtypes:

price_datetime    datetime64[ns]
symbol                    object
bid_open                 float64
bid_high                 float64
bid_low                  float64
bid_close                float64
ask_open                 float64
ask_high                 float64
ask_low                  float64
ask_close                float64
dtype: object

to these dtypes:

price_datetime     object
symbol             object
bid_open          float64
bid_high          float64
bid_low           float64
bid_close         float64
ask_open          float64
ask_high          float64
ask_low           float64
ask_close         float64
dtype: object

by doing:

df['price_datetime'] = df['price_datetime'].astype(object)

Now I (think) I am read to use to_gbq so I do:

import pandas
pandas.io.gbq.to_gbq(df, <table_name>, <project_name>, if_exists='append')

but I get the error:

---------------------------------------------------------------------------
InvalidSchema                             Traceback (most recent call last)
<ipython-input-15-d5a3f86ad382> in <module>()
      1 a = time.time()
----> 2 pandas.io.gbq.to_gbq(df, <table_name>, <project_name>, if_exists='append')
      3 b = time.time()
      4 
      5 print(b-a)

C:\Users\me\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\io\gbq.py in to_gbq(dataframe, destination_table, project_id, chunksize, verbose, reauth, if_exists, private_key)
    825         elif if_exists == 'append':
    826             if not connector.verify_schema(dataset_id, table_id, table_schema):
--> 827                 raise InvalidSchema("Please verify that the structure and "
    828                                     "data types in the DataFrame match the "
    829                                     "schema of the destination table.")

InvalidSchema: Please verify that the structure and data types in the DataFrame match the schema of the destination table.
like image 437
user1367204 Avatar asked Dec 14 '22 22:12

user1367204


2 Answers

I had to do two things that solved the issue for me. First, I deleted my table and reuploaded it with the columns as TIMESTAMP types rather than DATETIME types. This made sure that the schema matched when the pandas.DataFrame with column type datetime64[ns] was uploaded to using to_gbq, which converts datetime64[ns] to TIMESTAMP type and not to DATETIME type (for now).

The second thing I did was upgrade from pandas 0.19 to pandas 0.20. These two things solved my problem of a schema mismatch.

like image 151
user1367204 Avatar answered Dec 28 '22 07:12

user1367204


This is probably an issue related to pandas. If you check the code for to_gbq, you'll see that it runs this code:

table_schema = _generate_bq_schema(dataframe)

Where _generate_bq_schema is given by:

def _generate_bq_schema(df, default_type='STRING'):
    """ Given a passed df, generate the associated Google BigQuery schema.
    Parameters
    ----------
    df : DataFrame
    default_type : string
        The default big query type in case the type of the column
        does not exist in the schema.
    """

    type_mapping = {
        'i': 'INTEGER',
        'b': 'BOOLEAN',
        'f': 'FLOAT',
        'O': 'STRING',
        'S': 'STRING',
        'U': 'STRING',
        'M': 'TIMESTAMP'
    }

    fields = []
    for column_name, dtype in df.dtypes.iteritems():
        fields.append({'name': column_name,
                       'type': type_mapping.get(dtype.kind, default_type)})

    return {'fields': fields}

As you can see, there's no type mapping to DATETIME. This inevitably gets mapped to type STRING (since its dtype.kind is "O") and then conflict occurs.

The only work around for now that I'm aware of would be to change your table schema from DATETIME to either TIMESTAMP or STRING.

It probably would be a good idea to start a new issue on pandas-bq repository asking to update this code to accept DATETIME as well.

[EDIT]:

I've opened this issue in their repository.

like image 37
Willian Fuks Avatar answered Dec 28 '22 05:12

Willian Fuks