Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Limit on the rate of inferences one can make for a SageMaker endpoint

Is there a limit on the rate of inferences one can make for a SageMaker endpoint?

Is it determined somehow by the instance type behind the endpoint or the number of instances?

I tried looking for this info as AWS Service Quotas for SageMaker but couldn't find it.

I am invoking the endpoint from a Spark job abd wondered if the number of concurrent tasks is a factor I should be taking care of when running inference (assuming each task runs one inference at a time)

Here's the throttling error I got:

com.amazonaws.services.sagemakerruntime.model.AmazonSageMakerRuntimeException: null (Service: AmazonSageMakerRuntime; Status Code: 400; Error Code: ThrottlingException; Request ID: b515121b-f3d5-4057-a8a4-6716f0708980)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1712)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1367)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1113)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:770)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:744)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:726)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:686)
    at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:668)
    at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:532)
    at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:512)
    at com.amazonaws.services.sagemakerruntime.AmazonSageMakerRuntimeClient.doInvoke(AmazonSageMakerRuntimeClient.java:236)
    at com.amazonaws.services.sagemakerruntime.AmazonSageMakerRuntimeClient.invoke(AmazonSageMakerRuntimeClient.java:212)
    at com.amazonaws.services.sagemakerruntime.AmazonSageMakerRuntimeClient.executeInvokeEndpoint(AmazonSageMakerRuntimeClient.java:176)
    at com.amazonaws.services.sagemakerruntime.AmazonSageMakerRuntimeClient.invokeEndpoint(AmazonSageMakerRuntimeClient.java:151)
    at lineefd06a2d143b4016906a6138a6ffec15194.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$$$a5cddfc4633c5dd8aa603ddc4f9aad5$$$$w$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$Predictor.predict(command-2334973:41)
    at lineefd06a2d143b4016906a6138a6ffec15200.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$$$50a9225beeac265557e61f69d69d7d$$$$w$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$2.apply(command-2307906:11)
    at lineefd06a2d143b4016906a6138a6ffec15200.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$$$50a9225beeac265557e61f69d69d7d$$$$w$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$2.apply(command-2307906:11)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
    at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:2000)
    at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1220)
    at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1220)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2321)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2321)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.doRunTask(Task.scala:140)
    at org.apache.spark.scheduler.Task.run(Task.scala:113)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$13.apply(Executor.scala:533)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1541)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:539)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
like image 212
Gaddy Avatar asked Oct 30 '25 16:10

Gaddy


1 Answers

Amazon SageMaker is offering model hosting service (https://docs.aws.amazon.com/sagemaker/latest/dg/how-it-works-hosting.html), which gives you a lot of flexibility based on your inference requirements.

As you noted, first you can choose the instance type to use for your model hosting. The large set of options is important to tune to your models. You can host the model on a GPU based machines (P2/P3/P4) or CPU ones. You can have instances with faster CPU (C4, for example), or more RAM (R4, for example). You can also choose instances with more cores (16xl, for example) or less (medium, for example). Here is a list of the full range of instances that you can choose: https://aws.amazon.com/sagemaker/pricing/instance-types/ . It is important to balance your performance and costs. The selection of the instance type and the type and size of your model will determine the invocations-per-second that you can expect from your model in this single-node configuration. It is important to measure this number to avoid hitting the throttle errors that you saw.

The second important feature of the SageMaker hosting that you use is the ability to auto-scale your model to multiple instances. You can configure the endpoint of your model hosting to automatically add and remove instances based on the load on the endpoint. AWS is adding a load balancer in front of the multiple instances that are hosting your models and distributing the requests among them. Using the autoscaling functionality allows you to keep a smaller instance for low traffic hours, and to be able to scale up during peak traffic hours, and still keep your costs low and your throttle errors to the minimum. See here for documentation on the SageMaker autoscaling options: https://docs.aws.amazon.com/sagemaker/latest/dg/endpoint-auto-scaling.html

like image 89
Guy Avatar answered Nov 02 '25 19:11

Guy



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!