I’m looking into several “transactional data lake” technologies such as Apache Hudi, Delta Lake, AWS Lake Formation Governed Tables.
Except for the latter, I can’t see how these would work in a multi cluster environment. I’m baselining against s3 for storage, and want to incrementally alter my data lake, where I may have many clusters all reading from and writing to the lake at any given time. Is this possible/supported? It seems like the compaction and transaction processes are on-cluster. And so you cannot manage a transactional data lake with these platforms from multiple disparate sources. Or am I mistaken?
Any anecdotes or performance limitations you’ve found would be appreciated!
You can enable a config for multiple writers on Apache Hudi and then use a lock provider as described here: https://hudi.apache.org/docs/concurrency_control#enabling-multi-writing
Example using an AWS DynamoDB lock provider:
hoodie.write.lock.provider=org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider
hoodie.write.lock.dynamodb.table
hoodie.write.lock.dynamodb.partition_key
hoodie.write.lock.dynamodb.region
Delta Lake has a warning in the documentation that multiple writers may result in data loss: https://docs.delta.io/latest/delta-storage.html#amazon-s3
Concurrent writes to the same Delta table from multiple Spark drivers can lead to data loss.
This is a blog you may find interesting that discusses common pitfalls in Lakehouse concurrency control.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With