Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Batched API call inside apache spark?

Tags:

apache-spark

I am a beginner to Apache Spark and I do have the following task:

I am reading records from a datasource that - within the spark transformations - need to be enhanced by data from a call to an external webservice before they can be processed any further.

The webservice will accept parallel calls to a certain extent, but only allows a few hundred records to be sent at once. Also, it's quite slow, so batching up as much as possible and parallel requests are definitely helping here.

Is there are way to do this with spark in a reasonable manner?

I thought of reading records, pre-process them to another datasource, then read the "API-Queue" data source 500 records at a time (if possible with multiple processes) and write the records to the next datasource, and use this result datasource to do the final transformations.

The only place where those weird limits need to be respected is within the API calls (that's why I thought some intermediate data format / data source would be appropriate).

Any ideas or directions you want to point me to?

like image 662
Gregor Melhorn Avatar asked Feb 03 '16 05:02

Gregor Melhorn


2 Answers

You can do this with mapPartition, see this question:

Whats the Efficient way to call http request and read inputstream in spark MapTask

mapPartition is run once per partition, so you can have setup/teardown code run once. Do a coalesce before the mapPartition to reduce the number of partitions down to the level of concurrency the webservice can comfortably support.

You may want to sort the RDD first to avoid calling the webservice more than once for a given key, code the mapPartition appropriately to avoid hitting the same key repeatedly.

like image 107
jonathanChap Avatar answered Sep 20 '22 11:09

jonathanChap


If you call your external API inside your RDD processing, the call will be made in parallel by each Spark executor. Which, if you think about it, is what you want for a fast processing of your data.

If you want to compensate on your side for the sluggishness of the API, you can install a caching server on your side to deal with repeated requests, such as memcache, for example: http://memcached.org/

like image 34
Francois G Avatar answered Sep 18 '22 11:09

Francois G