I'd like to call a web service to get some data in Spark Structured Streaming. Is it possible? How?
TL)DR Technically, yes.
However for consuming webservices spark streaming is not necessary. Spark streaming is designed for long living applications which subscribe to a constantly emitting data source.Technically a webservice can emit events via long polling or server sent events. (I assume this is not the case here, otherwise you would mention). To consume a rest service in spark streaming you would need to implement a custom datasource.
A normal spark job makes more sense. Given the data workload justifies the complexity of distributed programming. However this is also not a very common case. Spark is used in big data context and accessing data over http is very slow for large data processing.
Instead of consuming a rest service via http, the service would publish data on a distributed queue instead. Then this queue is consumed by a spark streaming job or a normal spark batch processing job. Another strategy is just storing data into db, then rather consume it directly via jdbc datasource. Best practice is to replicate the data to a datalake/datawarehouse like hive or directly into the distributed filesystem like hdfs or amazon s3.
Still consuming a rest service in spark can be done technically. The default spark api also does not provide a rest datasource. However there are 3rd party implementations.
It can also be achieved in a normal spark job without a custom datascource. Given the websevice response fits in memory on a master node:
python code
data = requests.get('https://my.service.tm/api.json').json
# => [{id: 1, foo: 'bar'}, {id: 2, foo: 'baz' }]
df = spark.createDataFrame(data)
# => [Row(id=1, foo='bar'), Row(id=2, foo='baz')]
If the response does not fit in memory and this is a paginated api, one can create and rdd with n pages, then map page id to its response and optionally convert the rdd to a dataframe.
Can a web service called from spark job?
Sure. Think of Spark as a distributed computation engine where a computation is "calling a web service". The computation would be executed on executors and there could be thousands of your computations calling the web service en masse.
You could consider JDBC data source another web service, couldn't you? And for JDBC Spark comes with the JDBC data source. That could be the base for a data source to call web services.
Scheduling a Spark job for execution (of any kind) is as simple as using SparkContext.submitJob or SparkContext.runJob.
Just to give you a bit more background what a Spark job calling a web service would look like, consider the following:
val nums = sc.parallelize(1 to 5)
import org.apache.spark.TaskContext
def processPartition(ctx: TaskContext, ns: Iterator[Int]) = println("calling a web service")
sc.runJob(nums, processPartition _)
That's simple as that :)
In Spark SQL it'd be no different, but you're at a higher level. Rather than using RDD API directly, you simply rely on the high-level API of Spark SQL to generate a corresponding RDD-based code (so your computation is distributed).
A sample code could be as follows:
val nums = spark.range(5)
def callWs(n: Int): Int = {
println("calling a web service")
val result = n
result
}
val callWsUdf = udf(callWs _)
val q = nums.withColumn("ws", callWsUdf($"id"))
scala> q.show
calling a web service
calling a web service
calling a web service
calling a web service
calling a web service
+---+---+
| id| ws|
+---+---+
| 0| 0|
| 1| 1|
| 2| 2|
| 3| 3|
| 4| 4|
+---+---+
It's simply best to develop a custom data source for this to encapsulate the processing logic (if it's going to be used multiple times or by non-developers).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With