Batched API call inside apache spark?

Question

I am a beginner to Apache Spark and I do have the following task:

I am reading records from a datasource that - within the spark transformations - need to be enhanced by data from a call to an external webservice before they can be processed any further.

The webservice will accept parallel calls to a certain extent, but only allows a few hundred records to be sent at once. Also, it's quite slow, so batching up as much as possible and parallel requests are definitely helping here.

Is there are way to do this with spark in a reasonable manner?

I thought of reading records, pre-process them to another datasource, then read the "API-Queue" data source 500 records at a time (if possible with multiple processes) and write the records to the next datasource, and use this result datasource to do the final transformations.

The only place where those weird limits need to be respected is within the API calls (that's why I thought some intermediate data format / data source would be appropriate).

Any ideas or directions you want to point me to?

jonathanChap · Accepted Answer

You can do this with mapPartition, see this question:

Whats the Efficient way to call http request and read inputstream in spark MapTask

mapPartition is run once per partition, so you can have setup/teardown code run once. Do a coalesce before the mapPartition to reduce the number of partitions down to the level of concurrency the webservice can comfortably support.

You may want to sort the RDD first to avoid calling the webservice more than once for a given key, code the mapPartition appropriately to avoid hitting the same key repeatedly.

Francois G · Answer

If you call your external API inside your RDD processing, the call will be made in parallel by each Spark executor. Which, if you think about it, is what you want for a fast processing of your data.

If you want to compensate on your side for the sluggishness of the API, you can install a caching server on your side to deal with repeated requests, such as memcache, for example: http://memcached.org/

Batched API call inside apache spark?

Tags:

apache-spark

Gregor Melhorn

2 Answers

jonathanChap

Francois G

Recent Activity

Donate For Us

Batched API call inside apache spark?

Tags:

apache-spark

Gregor Melhorn

2 Answers

jonathanChap

Francois G

Related questions

Recent Activity

Donate For Us