Spark Dataframes UPSERT to Postgres Table

Question

I am using Apache Spark DataFrames to join two data sources and get the result as another DataFrame. I want to write the result to another Postgres table. I see this option :

myDataFrame.write.jdbc(url, table, connectionProperties)

But, what I want to do is UPSERT the dataframe into table based on the Primary Key of the Table. How is this to be done? I am using Spark 1.6.0.

zero323 · Accepted Answer

It is not supported. DataFrameWriter can either append to or overwrite existing table. If your application requires more complex logic you'll have to deal with this manually.

One option is to use an action (foreach, foreachPartition) with standard JDBC connection. Another one is to write to a temporary and handle the rest directly in the database.

See also SPARK-19335 (Spark should support doing an efficient DataFrame Upsert via JDBC) and related proposals.

jstuartmill · Answer

KrisP has the right of it. The best way to do an upsert is not through a prepared statement. It's important to note that this method will insert one at a time with as many partitions as the number of workers you have. If you want to do this in batch you can as well

import java.sql._
dataframe.coalesce("NUMBER OF WORKERS").mapPartitions((d) => Iterator(d)).foreach { batch =>
  val dbc: Connection = DriverManager.getConnection("JDBCURL")
  val st: PreparedStatement = dbc.prepareStatement("YOUR PREPARED STATEMENT")

  batch.grouped("# Of Rows you want per batch").foreach { session =>
    session.foreach { x =>
      st.setDouble(1, x.getDouble(1)) 
      st.addBatch()
    }
    st.executeBatch()
  }
  dbc.close()
}

This will execute batches for each worker and close the DB connection. It gives you control over how many workers, how many batches and allows you to work within those confines.

Spark Dataframes UPSERT to Postgres Table

Tags:

dataframe

postgresql

scala

apache-spark

apache-spark-sql

void

2 Answers

zero323

jstuartmill

Recent Activity

Donate For Us

Spark Dataframes UPSERT to Postgres Table

Tags:

dataframe

postgresql

scala

apache-spark

apache-spark-sql

void

2 Answers

zero323

jstuartmill

Related questions

Recent Activity

Donate For Us