Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Delta lake incremental manifest files generation

I am trying to setup delta lake on S3 using the open source delta lake api . My tables are partitioned by date and I have to perform merge (Merge may also update old partitions) . I am generating manifest files so that I can use AWS Athena to query the delta lake but when I run the manifest file generation method delta lakes creates manifest files for all the partitions . Is there a way to do incremental manifest files generation , create/update files only for the last updated partitions or if you can specify the partitions to produce the manifest files .

df = spark.read.csv(s3://temp/2020-01-01.csv)
delta_table = DeltaTable.forPath(spark, delta_table_path)

delta_table.alias("source").merge(df.alias("new_data"), condition).whenNotMatchedInsertAll().execute()

delta_table.generate("symlink_format_manifest")
like image 257
priyansh jain Avatar asked Jan 29 '26 20:01

priyansh jain


1 Answers

I was facing the same issue and running manifest on a huge table with tons of partitions was an overkill. I was able to resolve it by below two methods(workarounds)

  1. So the easy one is, use spark to create your delta table in Hive metastore using a DDL, provide the location to the folder(S3) along with TBLPROPERTIES(delta.compatibility.symlinkFormatManifest.enabled=true). Use spark to load the data in the same location and this will create/update manifest file for any partition as soon as the data is appended/overwritten.

spark.sql("CREATE TABLE student (id INT, name STRING, age INT) USING delta PARTITIONED BY (age) LOCATION 's3://path/student' TBLPROPERTIES(delta.compatibility.symlinkFormatManifest.enabled=true)")

For a new table this should not be a problem however, above is a workaround for a table which is already created and will require a reload.

  1. The other option(a tricky one) I followed is, locate and copy the metadata file(hdfs dfs -cat s3://path/student/_delta_log/*.json | grep 'metadata') in the _delta_log folder. Add the same aforementioned TBLPROPERTIES under commitInfo-->operationParameters as "properties":"{\"delta.compatibility.symlinkFormatManifest.enabled\":\"true\"} and under metaData as "configuration":{"delta.compatibility.symlinkFormatManifest.enabled":"true"} create a new .json and rename the file name as (last sequence of the json in _delta_log folder+1).json and move it to _delta_log. The next load onwards you can see it is creating manifest files automatically.
like image 55
Arvind Avatar answered Jan 31 '26 10:01

Arvind



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!