I use Spark 2.2.0.
How can I feed Amazon SQS stream to spark structured stream using pyspark?
This question tries to answer it for a non structured streaming and for scala by creating a custom receiver.
Is something similar possible in pyspark?
spark.readStream \
.format("s3-sqs") \
.option("fileFormat", "json") \
.option("queueUrl", ...) \
.schema(...) \
.load()
According to Databricks above receiver can be used for S3-SQS file source. However, for only SQS how may one approach.
I tried understanding from AWS-SQS-Receive_Message to receive message. However, how to directly send stream to spark streaming was not clear.
Amazon Kinesis Data Firehose is the easiest way to capture, transform, and load data streams into AWS data stores for near real-time analytics with existing business intelligence tools.
There are three ways to deal with streaming data: batch process it at intervals ranging from hours to days, process the stream in real time, or do both in a hybrid process. Batch processing has the advantage of being able to perform deep analysis, including machine learning, and the disadvantage of having high latency.
To put data into the stream, you must specify the name of the stream, a partition key, and the data blob to be added to the stream. The partition key is used to determine which shard in the stream the data record is added to. All the data in the shard is sent to the same worker that is processing the shard.
I know nothing about Amazon SQS, but "how can I feed Amazon SQS stream to spark structured stream using pyspark." is not possible with any external messaging system or a data source using Spark Structured Streaming (aka Spark "Streams").
It's the other way round in Spark Structured Streaming when it is Spark to pull data in at regular intervals (similarly to the way Kafka Consumer API works where it pulls data in not is given it).
In other words, Spark "Streams" is just another consumer of messages from a "queue" in Amazon SQS.
Whenever I'm asked to integrate an external system with Spark "Streams" I start writing a client for the system using the client/consumer API.
Once I have it, the next step is to develop a custom streaming Source for the external system, e.g. Amazon SQS, using the sample client code above.
While developing a custom streaming Source
you have to do the following steps:
Write a Scala class that implements the Source
trait
Register the Scala class (the custom Source
) with Spark SQL using META-INF/services/org.apache.spark.sql.sources.DataSourceRegister
file with the fully-qualified class name or use the fully-qualified class name in format
Having a custom streaming source is a two-part development with developing the source (and optionally registering it with Spark SQL) and using it in a Spark Structured Streaming application (in Python) by means of format
method.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With