Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

K8S Ingress: How to limit requests in flight per pod

I am porting an application to run within k8s. I have run into an issue with ingress. I am trying to find a way to limit the number of REST API requests in flight at any given time to each backend pod managed by a deployment.

See the image below the shows the architecture.

Ingress is being managed by nginx-ingress. For a given set of URL paths, the ingress forwards the request to a service that targets a deployment of REST API backend processes. The deployment is also managed by an HPA based upon CPU load.

What I want to do is find a way to queue up ingress requests such that there are never more than X requests in flight to any pod running our API backend process. (ex. only allow 50 requests in flight at once per pod)

Does anyone know how to put a request limit in place like this?

As a bonus question, the next thing I would need to do is have the HPA monitor the request queuing and automatically scale up/down the deployment to match the number of pods to the number of requests currently being processed / queued. For example if each pod can handle 100 requests in flight at once and we currently have load levels of 1000 requests to handle, then autoscale to 10 pods.

If it is useful, I am also planning to have linkerd in place for this cluster. Perhaps it has a capability that could help.

enter image description here

like image 464
Allen Avatar asked Oct 15 '22 22:10

Allen


2 Answers

Autoscaling in Network request requires the custom metrics. Given that you are using the NGINX ingress controller, you can first install prometheus and prometheus adaptor to export the metrics from NGINX ingress controller. By default, NGINX ingress controller has already exposed the prometheus endpoint.

The relation graph will be like this.

NGINX ingress <- Prometheus <- Prometheus Adaptor <- custom metrics api service <- HPA controller

The arrow means the calling in API. So, in total, you will have three more extract components in your cluster.

Once you have set up the custom metric server, you can scale your app based on the metrics from NGINX ingress. The HPA will look like this.

apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
  name: srv-deployment-custom-hpa
spec:
  scaleTargetRef:
    apiVersion: extensions/v1beta1
    kind: Deployment
    name: srv-deployment
  minReplicas: 1
  maxReplicas: 100
  metrics:
  - type: Pods
    pods:
      metricName: nginx_srv_server_requests_per_second
      targetAverageValue: 100

I won't go through the actual implementation here because it will include a lot of environment specific configuration.

Once you have set that up, you can see the HPA object will show up the metrics which is pulling from the adaptor.

For the rate limiting in the Service object level, you will need a powerful service mesh to do so. Linkerd2 is designed to be lightweight so it does not ship with the function in rate limiting. You can refer to this issue under linkerd2. The maintainer rejected to implement the rate limiting in the service level. They would suggest you to do this on Ingress level instead.

AFAIK, Istio and some advanced serivce mesh provides the rate limiting function. In case you haven't deployed the linkerd as your service mesh option, you may try Istio instead.

For Istio, you can refer this document to see how to do the rate limiting. But I need to let you know that Istio with NGINX ingress may cause you a trouble. Istio is shipped with its own ingress controller. You will need to have extra work for making it work.

To conclude, if you can use the HPA with custom metrics in the number of requests, it will be the quick solution to resolve your issue in traffic control. Unless you still have a really hard time with the traffic control, you will then need to consider the Service level rate limiting.

like image 52
Ryan Siu Avatar answered Oct 21 '22 16:10

Ryan Siu


Nginx ingress allow to have rate limiting with annotations. You may want to have a look at the limit-rps one:

  • nginx.ingress.kubernetes.io/limit-rps: number of requests accepted from a given IP each second. The burst limit is set to this limit multiplied by the burst multiplier, the default multiplier is 5. When clients exceed this limit, limit-req-status-code default: 503 is returned.

On top of that NGINX will queue your requests with the leaky bucket algorithm so the incoming requests will buffered in the queue with FIFO (first-in-first-out) algorithm and then consumed at limited rate. The burst value in this case defines the size of the queue which allows the request to exceed the beyond limit. When this queue become full the next requests will be rejected.

For more detailed reading about limit traffic and shaping:

  • Nginx rate limiting in nutshell
  • Rate limiting nginx
like image 2
acid_fuji Avatar answered Oct 21 '22 15:10

acid_fuji