How to compare two versions of delta table to get changes similar to CDC?

1 Answers

You can compute the difference of two versions of the table, but as you guessed it’s expensive to do. it’s also tricky to compute the actual difference when the delta table has changes other than appends.

usually when people ask about this, they’re trying to design their own system that gives them exactly one processing of data from delta to somewhere; spark streaming + Delta source already exists to do this

if you do want to write your own, you can read the transaction log directly (protocol spec is at https://github.com/delta-io/delta/blob/master/PROTOCOL.md) and use the actions in the versions between the two you’re computing to figure out which files have changes to read

Please note that versions of a delta table are cached (persisted by Spark) so comparing different datasets should be fairly cheap.

val v0 = spark.read.format("delta").option("versionAsOf", 0).load("/tmp/delta/t2")
val v1 = spark.read.format("delta").option("versionAsOf", 1).load("/tmp/delta/t2")
// v0 and v1 are persisted - see Storage tab in web UI

Getting those v0 and v1 isn’t expensive; comparing the two can be both expensive and tricky. If the table is append-only then it’s (v1 - v0); if it’s got upserts then you have to handle (v0 - v1) as well, and if it’s got metadata or protocol changes it gets even trickier.

And when you do all that logic yourself it’s suspiciously similar to re-implementing DeltaSource.

You may then consider the following:

val log = DeltaLog.forTable(spark, "/tmp/delta/t2")
val v0 = log.getSnapshotAt(0)
val actionsAtV0 = v0.state

val v1 = log.getSnapshotAt(1)
val actionsAtV1 = v1.state

actionsAtV0 and actionsAtV1 are all the actions that brought the delta table to versions 0 and 1, respectively, and can be considered a CDC of the delta table.

That's basically reading the transaction log, except using some Delta’s internal APIs to make that easier.

answered Jan 02 '23 23:01

Jacek Laskowski

Related questions
                            
                                spark delta overwrite a specific partition
                            
                                How to use Delta with Spark 3.0 Preview?
                            
                                Hive table on delta lake
                            
                                Get difference between two version of delta lake table
                            
                                How to checkpoint a delta table manually?
                            
                                Pyspark: Delta table as stream source, How to do it?
                            
                                Writing delta lake to AWS S3 (Without Databricks)
                            
                                Delta Lake without Databricks Runtime
                            
                                How do explicit table partitions in Databricks affect write performance?
                            
                                Using partitions (with partitionBy) when writing a delta lake has no effect
                            
                                Processing upserts on a large number of partitions is not fast enough

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to compare two versions of delta table to get changes similar to CDC?

Tags:

delta-lake

Jacek Laskowski

People also ask

1 Answers

Jacek Laskowski

Recent Activity

Donate For Us