Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sharing a resource (file) across different python processes using HDFS

So I have some code that attempts to find a resource on HDFS...if it is not there it will calculate the contents of that file, then write it. And next time it goes to be accessed the reader can just look at the file. This is to prevent expensive recalculation of certain functions

However...I have several processes running at the same time on different machines on the same cluster. I SUSPECT that they are trying to access the same resource and I'm hitting a race condition that leads a lot of errors where I either can't open a file or a file exists but can't be read.

Hopefully this timeline will demonstrate what I believe my issue to be

  1. Process A goes to access resource X
  2. Process A finds resource X exists and begins writing
  3. Process B goes to access resource X
  4. Process A finishes writing resource X ...and so on

Obviously I would want Process B to wait for Process A to be done with Resource X and simply read it when A is done.

Something like semaphores come to mind but I am unaware of how to use these across different python processes on separate processors looking at the same HDFS location. Any help would be greatly appreciated

UPDATE: To be clear..process A and process B will end up calculating the exact same output (i.e. the same filename, with the same contents, to the same location). Ideally, B shouldn't have to calculate it. B would wait for A to calculate it, then read the output once A is done. Essentially this whole process is working like a "long term cache" using HDFS. Where a given function will have an output signature. Any process that wants the output of a function, will first determine the output signature (this is basically a hash of some function parameters, inputs, etc.). It will then check the HDFS to see if it is there. If it's not...it will write calculate it and write it to the HDFS so that other processes can also read it.

like image 684
sedavidw Avatar asked Oct 30 '22 22:10

sedavidw


1 Answers

(Setting aside that it sounds like HDFS might not be the right solution for your use case, I'll assume you can't switch to something else. If you can, take a look at Redis, or memcached.)

It seems like this is the kind of thing where you should have a single service that's responsible for computing/caching these results. That way all your processes will have to do is request that the resource be created if it's not already. If it's not already computed, the service will compute it; once it's been computed (or if it already was), either a signal saying the resource is available, or even just the resource itself, is returned to your process.

If for some reason you can't do that, you could try using HDFS for synchronization. For example, you could try creating the resource with a sentinel value inside which signals that process A is currently building this file. Meanwhile process A could be computing the value and writing it to a temporary resource; once it's finished, it could just move the temporary resource over the sentinel resource. It's clunky and hackish, and you should try to avoid it, but it's an option.

You say you want to avoid expensive recalculations, but if process B is waiting for process A to compute the resource, why can't process B (and C and D) be computing it as well for itself/themselves? If this is okay with you, then in the event that a resource doesn't already exist, you could just have each process start computing and writing to a temporary file, then move the file to the resource location. Hopefully moves are atomic, so one of them will cleanly win; it doesn't matter which if they're all identical. Once it's there, it'll be available in the future. This does involve the possibility of multiple processes sending the same data to the HDFS cluster at the same time, so it's not the most efficient, but how bad it is depends on your use case. You can lessen the inefficiency by, for example, checking after computation and before upload to the HDFS whether someone else has created the resource since you last looked; if so, there's no need to even create the temporary resource.

TLDR: You can do it with just HDFS, but it would be better to have a service that manages it for you, and it would probably be even better not to use HDFS for this (though you still would possibly want a service to handle it for you, even if you're using Redis or memcached; it depends, once again, on your particular use case).

like image 152
Cyphase Avatar answered Nov 15 '22 04:11

Cyphase