I'm using python beam on google dataflow, my pipeline looks like this: <blockquote> Read image urls from file >> Download images >> Process images </blockquote> The problem is that I can't let Download images step scale as much as it needs because my application can get blocked from the image server. Is it a way that I can throttle the step ? Either on input or output per minute. Thank you.

One possibility, maybe naïve, is to introduce a sleep in the step. For this you need to know the maximum number of instances of the ParDo that can be running at the same time. If <code>autoscalingAlgorithm</code> is set to <code>NONE</code> you can obtain that from <code>numWorkers</code> and <code>workerMachineType</code> (DataflowPipelineOptions). Precisely, the effective rate will be divided by the total number of threads: <code>desired_rate/(num_workers*num_threads(per worker))</code>. The sleep time will be the inverse of that effective rate: <pre class="prettyprint lang-java prettyprint-override"><code>Integer desired_rate = 1; // QPS limit if (options.getNumWorkers() == 0) { num_workers = 1; } else { num_workers = options.getNumWorkers(); } if (options.getWorkerMachineType() != null) { machine_type = options.getWorkerMachineType(); num_threads = Integer.parseInt(machine_type.substring(machine_type.lastIndexOf("-") + 1)); } else { num_threads = 1; } Double sleep_time = (double)(num_workers * num_threads) / (double)(desired_rate); </code></pre> Then you can use <code>TimeUnit.SECONDS.sleep(sleep_time.intValue());</code> or equivalent inside the throttled Fn. In my example, as a use case, I wanted to read from a public file, parse out the empty lines and call the Natural Language Processing API with a maximum rate of 1 QPS (I initialized <code>desired_rate</code> to 1 previously): <pre class="prettyprint lang-java prettyprint-override"><code>p .apply("Read Lines", TextIO.read().from("gs://apache-beam-samples/shakespeare/kinglear.txt")) .apply("Omit Empty Lines", ParDo.of(new OmitEmptyLines())) .apply("NLP requests", ParDo.of(new ThrottledFn())) .apply("Write Lines", TextIO.write().to(options.getOutput())); </code></pre> The rate-limited Fn is <code>ThrottledFn</code>, notice the <code>sleep</code> function: <pre class="prettyprint"><code>static class ThrottledFn extends DoFn<String, String> { @ProcessElement public void processElement(ProcessContext c) throws Exception { // Instantiates a client try (LanguageServiceClient language = LanguageServiceClient.create()) { // The text to analyze String text = c.element(); Document doc = Document.newBuilder() .setContent(text).setType(Type.PLAIN_TEXT).build(); // Detects the sentiment of the text Sentiment sentiment = language.analyzeSentiment(doc).getDocumentSentiment(); String nlp_results = String.format("Sentiment: score %s, magnitude %s", sentiment.getScore(), sentiment.getMagnitude()); TimeUnit.SECONDS.sleep(sleep_time.intValue()); Log.info(nlp_results); c.output(nlp_results); } } } </code></pre> With this I get a 1 element/s rate as seen in the image below and avoid hitting quota when using multiple workers, even if requests are not really spread out (you might get 8 simultaneous requests and then 8s sleep, etc.). This was just a test, possibly a better implemention would be using guava's rateLimiter. <img src="https://i.stack.imgur.com/xenBh.png" alt="enter image description here"> If the pipeline is using autoscaling (<code>THROUGHPUT_BASED</code>) then it would be more complicated and the number of workers should be updated (for example, Stackdriver Monitoring has a <code>job/current_num_vcpus</code> metric). Other general considerations would be controlling the number of parallel ParDos by using a dummy GroupByKey or splitting the source with splitIntoBundles, etc. I'd like to see if there are other nicer solutions.

Throttling a step in beam application

1 Answers

One possibility, maybe naïve, is to introduce a sleep in the step. For this you need to know the maximum number of instances of the ParDo that can be running at the same time. If autoscalingAlgorithm is set to NONE you can obtain that from numWorkers and workerMachineType (DataflowPipelineOptions). Precisely, the effective rate will be divided by the total number of threads: desired_rate/(num_workers*num_threads(per worker)). The sleep time will be the inverse of that effective rate:

Integer desired_rate = 1; // QPS limit

if (options.getNumWorkers() == 0) { num_workers = 1; }
else { num_workers = options.getNumWorkers(); }

if (options.getWorkerMachineType() != null) { 
    machine_type = options.getWorkerMachineType();
    num_threads = Integer.parseInt(machine_type.substring(machine_type.lastIndexOf("-") + 1));
}
else { num_threads = 1; }

Double sleep_time = (double)(num_workers * num_threads) / (double)(desired_rate);

Then you can use TimeUnit.SECONDS.sleep(sleep_time.intValue()); or equivalent inside the throttled Fn. In my example, as a use case, I wanted to read from a public file, parse out the empty lines and call the Natural Language Processing API with a maximum rate of 1 QPS (I initialized desired_rate to 1 previously):

p
    .apply("Read Lines", TextIO.read().from("gs://apache-beam-samples/shakespeare/kinglear.txt"))
    .apply("Omit Empty Lines", ParDo.of(new OmitEmptyLines()))
    .apply("NLP requests", ParDo.of(new ThrottledFn()))
    .apply("Write Lines", TextIO.write().to(options.getOutput()));

The rate-limited Fn is ThrottledFn, notice the sleep function:

static class ThrottledFn extends DoFn<String, String> {
    @ProcessElement
    public void processElement(ProcessContext c) throws Exception {

        // Instantiates a client
        try (LanguageServiceClient language = LanguageServiceClient.create()) {

          // The text to analyze
          String text = c.element();
          Document doc = Document.newBuilder()
              .setContent(text).setType(Type.PLAIN_TEXT).build();

          // Detects the sentiment of the text
          Sentiment sentiment = language.analyzeSentiment(doc).getDocumentSentiment();                 
          String nlp_results = String.format("Sentiment: score %s, magnitude %s", sentiment.getScore(), sentiment.getMagnitude());

          TimeUnit.SECONDS.sleep(sleep_time.intValue());

          Log.info(nlp_results);
          c.output(nlp_results);
        }
    }
}

With this I get a 1 element/s rate as seen in the image below and avoid hitting quota when using multiple workers, even if requests are not really spread out (you might get 8 simultaneous requests and then 8s sleep, etc.). This was just a test, possibly a better implemention would be using guava's rateLimiter.

enter image description here

If the pipeline is using autoscaling (THROUGHPUT_BASED) then it would be more complicated and the number of workers should be updated (for example, Stackdriver Monitoring has a job/current_num_vcpus metric). Other general considerations would be controlling the number of parallel ParDos by using a dummy GroupByKey or splitting the source with splitIntoBundles, etc. I'd like to see if there are other nicer solutions.

answered Oct 18 '22 06:10

Guillem Xercavins

Related questions
                            
                                Python - Fine Uploader Server Side AWS Version 4 signing request
                            
                                Multithreaded pyodbc connection
                            
                                How can I replace OrderedDict with dict in a Python AST before literal_eval?
                            
                                Export environment variables at runtime with airflow
                            
                                How to detect if Chrome browser is headless in selenium?
                            
                                Pydub Slice Audio Segment By Sample
                            
                                KeyError when shifting column data back 1
                            
                                How can I make Ansible fail when the systemd service fails to start?
                            
                                WordNetLemmatizer: Different handling of wn.ADJ and wn.ADJ_SAT?
                            
                                Job is not performed by APScheduler's BackgroundScheduler
                            
                                Pandas.read_excel sometimes incorrectly reads Boolean values as 1's/0's
                            
                                recognize_google speech recognition broken pipe python
                            
                                how can I save a string data to TFRecord?
                            
                                getting all possible combinations of a list in a form of sublists
                            
                                pytest - helper function or fixture, parametrization
                            
                                How to perform KerasClassifier model selection with varying input dimensions [duplicate]
                            
                                Strange cmap background_gradient behavior
                            
                                Cannot install kenlm package in anaconda environment
                            
                                Pandas reading html table
                            
                                Pandas memory usage inconsistencies

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Throttling a step in beam application

Tags:

python

apache-beam

dataflow

google-cloud-dataflow

Xitrum

People also ask

1 Answers

Guillem Xercavins

Recent Activity

Donate For Us