Does Apache Spark SQL support MERGE clause that's similar to Oracle's MERGE SQL clause? <pre class="prettyprint"><code>MERGE into <table> using ( select * from <table1> when matched then update... DELETE WHERE... when not matched then insert... ) </code></pre>

Spark does support MERGE operation using Delta Lake as storage format. The first thing to do is to save the table using the <code>delta</code> format to provide support for transactional capabilities and support for DELETE/UPDATE/MERGE operations with spark Python/scala: <code>df.write.format("delta").save("/data/events")</code> SQL: <code>CREATE TABLE events (eventId long, ...) USING delta</code> Once the table exists, you can run your usual SQL Merge command: <pre class="prettyprint"><code>MERGE INTO events USING updates ON events.eventId = updates.eventId WHEN MATCHED THEN UPDATE SET events.data = updates.data WHEN NOT MATCHED THEN INSERT (date, eventId, data) VALUES (date, eventId, data) </code></pre> The command is also available in Python/Scala: <pre class="prettyprint"><code>DeltaTable.forPath(spark, "/data/events/") .as("events") .merge( updatesDF.as("updates"), "events.eventId = updates.eventId") .whenMatched .updateExpr( Map("data" -> "updates.data")) .whenNotMatched .insertExpr( Map( "date" -> "updates.date", "eventId" -> "updates.eventId", "data" -> "updates.data")) .execute() </code></pre> To support Delta Lake format, you also need the delta package as dependency in your spark job: <pre class="prettyprint"><code><dependency> <groupId>io.delta</groupId> <artifactId>delta-core_x.xx</artifactId> <version>xxxx</version> </dependency> </code></pre> See https://docs.delta.io/latest/delta-update.html#upsert-into-a-table-using-merge for more details

Does Apache Spark SQL support MERGE clause?

Tags:

sql

apache-spark

apache-spark-sql

hadoop

databricks

Does Apache Spark SQL support MERGE clause that's similar to Oracle's MERGE SQL clause?

MERGE into <table> using (
  select * from <table1>
    when matched then update...
       DELETE WHERE...
    when not matched then insert...
)

482

asked Oct 06 '17 21:10

DilTeam

2 Answers

Spark does support MERGE operation using Delta Lake as storage format. The first thing to do is to save the table using the delta format to provide support for transactional capabilities and support for DELETE/UPDATE/MERGE operations with spark

Python/scala: df.write.format("delta").save("/data/events")

SQL: CREATE TABLE events (eventId long, ...) USING delta

Once the table exists, you can run your usual SQL Merge command:

MERGE INTO events
USING updates
ON events.eventId = updates.eventId
WHEN MATCHED THEN
  UPDATE SET events.data = updates.data
WHEN NOT MATCHED
  THEN INSERT (date, eventId, data) VALUES (date, eventId, data)

The command is also available in Python/Scala:

DeltaTable.forPath(spark, "/data/events/")
  .as("events")
  .merge(
    updatesDF.as("updates"),
    "events.eventId = updates.eventId")
  .whenMatched
  .updateExpr(
    Map("data" -> "updates.data"))
  .whenNotMatched
  .insertExpr(
    Map(
      "date" -> "updates.date",
      "eventId" -> "updates.eventId",
      "data" -> "updates.data"))
  .execute()

To support Delta Lake format, you also need the delta package as dependency in your spark job:

<dependency>
  <groupId>io.delta</groupId>
  <artifactId>delta-core_x.xx</artifactId>
  <version>xxxx</version>
</dependency>

See https://docs.delta.io/latest/delta-update.html#upsert-into-a-table-using-merge for more details

182

answered Oct 02 '22 20:10

Quentin

you can write your custom code: Below code you can edit to go with merge instead of Insert. Make sure this is computation heavy operations. but get y

  df.rdd.coalesce(2).foreachPartition(partition => {
  val connectionProperties = brConnect.value
  val jdbcUrl = connectionProperties.getProperty("jdbcurl")
  val user = connectionProperties.getProperty("user")
  val password = connectionProperties.getProperty("password")
  val driver = connectionProperties.getProperty("Driver")
  Class.forName(driver)

  val dbc: Connection = DriverManager.getConnection(jdbcUrl, user, password)
  val db_batchsize = 1000
  var pstmt: PreparedStatement = null

  partition.grouped(db_batchsize).foreach(batch => {
    batch.foreach{ row =>
      {
        val id = row.id
        val fname = row.fname
        val lname = row.lname
        val userid = row.userid
        println(id, fname)
        val sqlString = "INSERT employee USING   " +
        " values (?, ?, ?, ?) "

        var pstmt: PreparedStatement = dbc.prepareStatement(sqlString)
        pstmt.setLong(1, row.id)
        pstmt.setString(2, row.fname)
        pstmt.setString(3, row.lname)
        pstmt.setString(4, row.userid)
        pstmt.addBatch()
        pstmt.executeBatch()
      }

    }
    //pstmt.executeBatch()
    dbc.commit()
    pstmt.close()
  })
  dbc.close()
} )

answered Oct 02 '22 21:10

Dip

Related questions
                            
                                unaccent() preventing index usage in Postgres
                            
                                Grouping and counting rows by value until it changes
                            
                                Return Anonymous Type using SqlQuery RAW Query in Entity Framework
                            
                                SQLite - How to perform COUNT() with a WHERE condition?
                            
                                PHP script creates an empty SQL file
                            
                                How to show leading/trailing whitespace in a PostgreSQL column?
                            
                                The named 'CommandType' does not exist in the current context
                            
                                Mysql to get the current date with time 23:59:59
                            
                                How to use GROUP BY to count a new category and old categories at once
                            
                                SQL Server Query - Weird Behaviour
                            
                                SQL - Concat full name, and a space only if last name is present
                            
                                What is the expected behaviour for multiple set-returning functions in SELECT clause?
                            
                                Execution Timeout Expired. The timeout period elapsed prior to completion of the operation or the server is not responding [duplicate]
                            
                                OPENJSON cross apply with NULL values (TSQL)
                            
                                How to simply and efficiently query for nested relationships in SQL?
                            
                                Issues with JSON_EXTRACT in Presto for keys containing ' ' character
                            
                                How to convert an Epoch timestamp to a Date in Standard SQL
                            
                                How to create tables with N:M relationship in MySQL?
                            
                                Split comma separated string into rows in mysql
                            
                                Two foreign keys, one of them not NULL: How to solve this in SQL?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With