Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it against the Apache Beam Programming Model to Invoke an API?

Tags:

apache-beam

When using Apache Beam to enrich data, would it be wrong to make an API call for each data item

(I'm new to Apache Beam)

like image 457
Ravindranath Akila Avatar asked Jul 26 '17 05:07

Ravindranath Akila


People also ask

Which programming languages are supported by the Apache Beam SDK?

Apache Beam SDKs Beam currently supports the following language-specific SDKs: Apache Beam Java SDK. Apache Beam Python SDK. Apache Beam Go SDK.

What is the purpose of Apache Beam?

Apache Beam is an open source, unified model for defining both batch- and streaming-data parallel-processing pipelines. The Apache Beam programming model simplifies the mechanics of large-scale data processing. Using one of the Apache Beam SDKs, you build a program that defines the pipeline.

What is beam API?

A simplified, single programming model for both batch and streaming use cases for every member of your data and application teams.

What is the difference between Apache Beam and spark?

Apache Beam means a unified programming model. It implements batch and streaming data processing jobs that run on any execution engine. It executes pipelines in multiple execution environments. Apache Spark defines as a fast and general engine for large-scale data processing.


2 Answers

No, but you can batch API calls for better performance. Check out "Batched RPC" example in this blog post.

Another thing to note is that beam cannot guarantee exactly-once for external API calls. If the pipeline in question needs exactly-once QoS, you should strive to make API calls idempotent.

like image 165
Jiayuan Ma Avatar answered Nov 02 '22 04:11

Jiayuan Ma


This depends on the type of API call and the size of your data. If the data specify the API call that needs to be made, this is appropriate. However, if you have some limited set of data that is used to enrich your input elements, it may be possible to use parts of the Beam programming model to reduce the number of external calls required.

As an example, if the data that produces the results of your API call can be preloaded, you may be able to use a side input by reading all of the data and using View.asMap (or whatever view is appropriate), reducing the time number of API calls to some relatively constant number per execution. The side input can then be consumed by using ParDo.withSideInputs; see the programming guide, specifically about ParDo and SideInputs

like image 24
Thomas Groh Avatar answered Nov 02 '22 04:11

Thomas Groh