Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

BigQuery - remove unused column from schema

I accidentally added a wrong column to my BigQuery table schema.

Instead of reloading the complete table (million of rows), I would like to know if the following is possible:

  • remove bad rows (rows with values contains the wrong column) by running a "select *" query on the table with some kind of filter, and saving result to same table.
  • removing the (now) unused column.

Is this functionality (or similar) supported? Possibly the "save result to table" functionality can have a "compact schema" option.

like image 464
Lior Avatar asked Feb 15 '16 10:02

Lior


People also ask

How do I edit a table schema in BigQuery?

In the Google Cloud console, go to the BigQuery page. In the Explorer panel, expand your project and dataset, then select the table. In the details panel, click the Schema tab. Click Edit schema.

How do I drop a column if exists?

In SQL, if you want to remove a column from a table, you need to use the ALTER TABLE statement with the DROP COLUMN clause. That removes the column and all its data.

How do I edit text schema in BigQuery?

In the Schema section, enter the schema definition. Option 1: Use Add field and specify each field's name, type, and mode. Option 2: Click Edit as text and paste the schema in the form of a JSON array.


1 Answers

If your table does not consist of record/repeated type fields - your simple option is:

  1. Select valid columns while filtering out bad records into new temp table

    SELECT < list of original columns >
    FROM YourTable
    WHERE < filter to remove bad entries here >

    Write above to temp table - YourTable_Temp

  2. Make a backup copy of "broken" table - YourTable_Backup

  3. Delete YourTable
  4. Copy YourTable_Temp to YourTable
  5. Check if all looks as expected and if so - get rid of temp and backup tables

Please note: the cost of above #1 is exactly the same as action in first bullet in your question. The rest of actions (copy) are free

In case if you have repeated/record fields - you still can execute above plan, but in #1 you will need to use some BigQuery User-Defined Functions to have proper schema in output
You can see below for examples - of course this will require some extra dev - but if you are in critical situation - this should work for you

Create a table with Record type column
create a table with a column type RECORD

I hope, at some point Google BigQuery Team will add better support for cases like yours when you need to manipulate and output repeated/record data, but for now this is a best workaround I found - at least for myself

like image 133
Mikhail Berlyant Avatar answered Sep 19 '22 16:09

Mikhail Berlyant