Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can you run a transactional data lake (Hudi, Delta Lake) with multiple EMR clusters

I’m looking into several “transactional data lake” technologies such as Apache Hudi, Delta Lake, AWS Lake Formation Governed Tables.

Except for the latter, I can’t see how these would work in a multi cluster environment. I’m baselining against s3 for storage, and want to incrementally alter my data lake, where I may have many clusters all reading from and writing to the lake at any given time. Is this possible/supported? It seems like the compaction and transaction processes are on-cluster. And so you cannot manage a transactional data lake with these platforms from multiple disparate sources. Or am I mistaken?

Any anecdotes or performance limitations you’ve found would be appreciated!

like image 728
zachd1_618 Avatar asked Nov 20 '25 19:11

zachd1_618


1 Answers

You can enable a config for multiple writers on Apache Hudi and then use a lock provider as described here: https://hudi.apache.org/docs/concurrency_control#enabling-multi-writing

Example using an AWS DynamoDB lock provider:

hoodie.write.lock.provider=org.apache.hudi.aws.transaction.lock.DynamoDBBasedLockProvider
hoodie.write.lock.dynamodb.table
hoodie.write.lock.dynamodb.partition_key
hoodie.write.lock.dynamodb.region

Delta Lake has a warning in the documentation that multiple writers may result in data loss: https://docs.delta.io/latest/delta-storage.html#amazon-s3

Concurrent writes to the same Delta table from multiple Spark drivers can lead to data loss.

This is a blog you may find interesting that discusses common pitfalls in Lakehouse concurrency control.

like image 58
Kyle Weller Avatar answered Nov 23 '25 12:11

Kyle Weller



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!