I use Apache Camel, which monitors a directory shared by numerous nodes for new files.
The requirement of the application is that processing should be started only when two different types of files show up in the monitored directory: fileA and fileB.
How to guarantee in Apache Camel that if node1 picks up fileA then node2 (or any other) will not pick up fileB?
Camel has some (experimental) clustering capabilities - see here.
In your particular case, you could model a route which is taking the leadership when starting the directory monitoring, preventing thereby other nodes from picking the (same or other) files.
If your goal is to process incoming files in parallel by balancing them to a certain (most likely, dynamic) amount of nodes, I'd recommend redesigning the pipeline so that the nodes don't compete for new files.
My best advice is to decouple it in a way where clients generate new files and load them to one staging folder, then a background daemon process (e.g. a cron-scheduled bash script) checks if there're fileA and fileB both uploaded, bundles them into a zip, and moves the resulting archive to another folder, which is being monitored by the processing nodes. Thus, nodes are free of the puzzling job of how to exclusively fetch files in groups.
But in the case you can't change anything on the file server, only solution I have in mind is to use a shared lock on folder monitoring operation. This can be implemented as LOCK TABLE within a shared database, or by using a distributed lock in a data grid system like Hazelcast (see disctributed lock in Hazelcast) or Redis (see distributed lock in Redis).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With