Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get difference between two version of delta lake table

how to find the difference between two last versions of a Delta Table ? Here is as far as I went using dataframes :

val df1 = spark.read
  .format("delta")
  .option("versionAsOf", "0001")
  .load("/path/to/my/table")

val df2 = spark.read
  .format("delta")
  .option("versionAsOf", "0002")
  .load("/path/to/my/table")

// non idiomatic way to do it ...
df1.unionAll(df2).except(df1.intersect(df2))

there is a commercial version of Delta by Databricks that provides a solution called CDF but I'm looking for an open source alternative

like image 820
Ismail H Avatar asked Nov 07 '22 00:11

Ismail H


1 Answers

This return a data frame with the comparative

import uk.co.gresearch.spark.diff.DatasetDiff

df1.diff(df2)
like image 62
fidelin Avatar answered Nov 26 '22 20:11

fidelin