Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to overwrite data with PySpark's JDBC without losing schema?

I have a DataFrame that I'm willing to write it to a PostgreSQL database. If I simply use the "overwrite" mode, like:

df.write.jdbc(url=DATABASE_URL, table=DATABASE_TABLE, mode="overwrite", properties=DATABASE_PROPERTIES)

The table is recreated and the data is saved. But the problem is that I'd like to keep the PRIMARY KEY and Indexes in the table. So, I'd like to either overwrite only the data, keeping the table schema or to add the primary key constraint and indexes afterward. Can either one be done with PySpark? Or do I need to connect to the PostgreSQL and execute the commands to add the indexes myself?

like image 320
Fernando Camargo Avatar asked Apr 04 '18 12:04

Fernando Camargo


1 Answers

The default behavior for mode="overwrite" is to first delete the table, then recreate it with the new data. You can instead truncate the data by including option("truncate", "true") and then push your own:

df.write.option("truncate", "true").jdbc(url=DATABASE_URL, table=DATABASE_TABLE, mode="overwrite", properties=DATABASE_PROPERTIES)

This way, you are not recreating the table so it shouldn't make any modifications to your schema.

like image 170
user3124181 Avatar answered Sep 17 '22 23:09

user3124181