Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I return the result of a mapreduce operation to an AWS API request

I have a program that performs several thousand monte-carlo simulations to predict a result; I can't say what they really predict, so I'm going to use another example from "the indisputable existence of santa claus", since the content of those algorithms are not relevant to the question. I want to know how often each square on a Monopoly board is visited (to predict which the best properties to buy are). To do this, I simulate thousands of games and collate the results. My current implementation is a stand-alone C# application but I want to move it to the cloud so that I can provide this as a service - each user can get personalised results by submitting the number of sides that each of their dice have.

The current implementation is also quite slow - it is very parallisable since each simulation is entirely independent but I only have 8 cores, so it takes upwards of 20 minutes to complete the full prediction with about 50000 individual simulations on my local machine.

The plan is to have AWS lambda functions each run one (or several) simulations and then collate - basically mapreduce it. I looked in to using AWS EMR (Elastic MapReduce) but that is way too large-scale for what I want, spinning up the instances to run the computations alone seems to take longer than the whole calculation alone (which would not matter for multi-hour offline analyses, but I want low-latency to respond over a web request).

The ideal as I see it would be:

Lambda 0 - Fires off many other lambda functions, each doing a small part of the calculation. Lambda 1..N - Do many simulations in parallel (the number is not a constant). Lambda N+1 - Collate all the results and return the answer.

There is a lambda mapreduce framework here:

https://github.com/awslabs/lambda-refarch-mapreduce

But it seems to have one major drawback - each time a map stage completes, it writes its results to S3 (I'm fine with using that as a temporary) then triggers a new lambda via an event. That triggered lambda looks to see if all the results have been written to storage yet. If not, it ends, if yes it does the reduction step. That seems like a fair solution, but I'm just slightly concerned about a) race-hazards when two results come in together, could two reducers both compute the results? And b) that seems like it is firing off a lot of lambdas that all just decide not to run (I know they're cheap to run, but doubling the number to two per simulation - calculate and maybe reduce - will obviously double the costs). Is there a way to fire off an S3 result after, say, 100 files are written to a folder instead of after every one?

I looked at using step functions, but I'm not sure how to fire many lambdas in parallel in one step and have them all return before the state machine transitions. Step functions would however be useful for the final wrinkle - I want to hide all this behind an API.

From what I've read, APIs can fire off a lambda and return the result of that lambda, but I don't want the invoked lambda to be the one returning the result. It isn't when you instead invoke a step function from the API, the results of the last state are returned by the API call instead.

In short, I want:

API request -> Calculate results in parallel -> API response

It is that bit in the middle I'm not clear how to do, while being able to return all the results as a response to the original request - either on their own are easy.

A few options I can see:

Use a step function, which is natively supported by the AWS API gateway now, and invoke multiple lambdas in one state, waiting for them all to return before transitioning.

Use AWS EMR, but somehow keep the provisioned instances always live to avoid the provisioning time overheads. This obviously negates the scalability of Lambda and is more expensive.

Use the mapreduce framework, or something similar, and find a way to respond to an incoming request from a different lambda to the one that was initially invoked by the API request. Ideally also reduce the number of S3 events involved here, but that's not a priority.

Respond instantly to the original API request from the first lambda, then push more data to the user later when the calculations finish (they should only take about 30 seconds with the parallelism, and the domain is such that that is an acceptable time to wait for a response, even an HTTP response).

I doubt it will make any difference to the solution, since it is just an expansion of the middle bit, not a fundamental change, but the real calculation is iterative, so would be:

Request -> Mapreduce -> Mapreduce -> ... -> Response

As long as I know how to chain one set of lambda functions within a request, chaining more should be just more of the same (I hope).

Thank you.

P.S. I can't create them, and neither the tags aws-emr nor aws-elastic-mapreduce exist yet.

like image 319
Y_Less Avatar asked Jul 27 '17 19:07

Y_Less


People also ask

How do you find the output of a step function?

To get the function output from a Step Function, you have to add a second method in API Gateway which will call the Step Function with the DescribeExecution action. The API Gateway client will have to call this periodically (poll) until the returned status is no longer "RUNNING".


1 Answers

One idea would be to call a Lambda function (call it 'workflow director') via API GW, then write code in that function to call step functions (or whatever) directly and poll the state so you can eventually respond synchronously to the HTTP request.

That's just a sync wrapper around the async workflow. Keep in mind that API GW has a hard timeout at 29 seconds, so if you expect that this workflow will take around 30 seconds, it might not be worth it to implement a sync version.

The async model (I guess in this case calling step function directly from API GW) would work in either case.

Edit: sorry, may have misunderstood your comment about step functions. I thought there was no synchronous way to call the step functions workflow and await the final state, but from your comment it seems that there already is.

Let me quickly answer a couple of your specific questions:

Is there a way to fire off an S3 result after, say, 100 files are written to a folder instead of after every one?

I believe this is not possible.

I'm not sure how to fire many lambdas in parallel in one step and have them all return before the state machine transitions

Did you see this in the docs? http://docs.aws.amazon.com/step-functions/latest/dg/amazon-states-language-parallel-state.html

like image 105
jackko Avatar answered Oct 18 '22 13:10

jackko