I am trying to process some XML data received on a JMS queue (QPID) using Spark Streaming. After getting xml as DStream I convert them to Dataframes so I can join them with some of my static data in form of Dataframes already loaded. But as per API documentation for foreachRdd method on DStream: it gets executed on Driver, so does that mean all processing logic will only run on Driver and not get distributed to workers/executors.
API Documentation
foreachRDD(func)
The most generic output operator that applies a function, func, to each RDD generated from the stream. This function should push the data in each RDD to an external system, such as saving the RDD to files, or writing it over the network to a database. Note that the function func is executed in the driver process running the streaming application, and will usually have RDD actions in it that will force the computation of the streaming RDDs.
so does that mean all processing logic will only run on Driver and not get distributed to workers/executors.
No, the function itself runs on the driver, but don't forget that it operates on an RDD
. The inner functions that you'll use on the RDD
, such as foreachPartition
, map
, filter
etc will still run on the worker nodes. This won't cause all the data to be sent back over the network to the driver, unless you call methods like collect
, which do.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With