Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What to do to prevent Delta Lake checkpoints to be removed in Azure Databricks?

I noticed that I have only 2 checkpoints files in a delta lake folder. Every 10 commits, a new checkpoint is created and the oldest one is removed.

For instance this morning, I had 2 checkpoints: 340 and 350. I was available to time travel from 340 to 359.

Now, after a "write" action, I have 2 checkpoints: 350 and 360. I'm now able to time travel from 350 to 360. What can remove the old checkpoints? How can I prevent that?

I'm using Azure Databricks 7.3 LTS ML.

like image 962
Nastasia Avatar asked Sep 14 '25 01:09

Nastasia


2 Answers

Ability to perform time travel isn't directly related to the checkpoint. Checkpoint is just an optimization that allows to quickly access metadata as Parquet file without need to scan individual transaction log files. This blog post describes the details of the transaction log in more details

The commits history is retained by default for 30 days, and could be customized as described in documentation. Please note that vacuum may remove deleted files that are still referenced in the commit log, because data is retained only for 7 days by default. So it's better to check corresponding settings.

If you perform following test, then you can see that you have history for more than 10 versions:

df = spark.range(10)
for i in range(20):
  df.write.mode("append").format("delta").save("/tmp/dtest")
  # uncomment if you want to see content of log after each operation
  #print(dbutils.fs.ls("/tmp/dtest/_delta_log/"))

then to check files in log - you should see both checkpoints and files for individual transactions:

%fs ls /tmp/dtest/_delta_log/

also check the history - you should have at least 20 versions:

%sql

describe history delta.`/tmp/dtest/`

and you should be able to go to the early version:

%sql

SELECT * FROM delta.`/tmp/dtest/` VERSION AS OF 1
like image 55
Alex Ott Avatar answered Sep 17 '25 21:09

Alex Ott


If you want to keep your checkpoints X days, you can set delta.checkpointRetentionDuration to X days this way:

spark.sql(f"""
        ALTER TABLE delta.`path`
            SET TBLPROPERTIES (
                delta.checkpointRetentionDuration = 'X days'
            )
        """
)
like image 27
Nastasia Avatar answered Sep 17 '25 20:09

Nastasia