How to load streaming data from Amazon SQS?

Tags:

I use Spark 2.2.0.

How can I feed Amazon SQS stream to spark structured stream using pyspark?

This question tries to answer it for a non structured streaming and for scala by creating a custom receiver.
Is something similar possible in pyspark?

spark.readStream \
   .format("s3-sqs") \
   .option("fileFormat", "json") \
   .option("queueUrl", ...) \
   .schema(...) \
   .load()

According to Databricks above receiver can be used for S3-SQS file source. However, for only SQS how may one approach.

I tried understanding from AWS-SQS-Receive_Message to receive message. However, how to directly send stream to spark streaming was not clear.

581

asked Dec 28 '17 12:12

OSK

1 Answers

I know nothing about Amazon SQS, but "how can I feed Amazon SQS stream to spark structured stream using pyspark." is not possible with any external messaging system or a data source using Spark Structured Streaming (aka Spark "Streams").

It's the other way round in Spark Structured Streaming when it is Spark to pull data in at regular intervals (similarly to the way Kafka Consumer API works where it pulls data in not is given it).

In other words, Spark "Streams" is just another consumer of messages from a "queue" in Amazon SQS.

Whenever I'm asked to integrate an external system with Spark "Streams" I start writing a client for the system using the client/consumer API.

Once I have it, the next step is to develop a custom streaming Source for the external system, e.g. Amazon SQS, using the sample client code above.

While developing a custom streaming Source you have to do the following steps:

Write a Scala class that implements the Source trait
Register the Scala class (the custom Source) with Spark SQL using META-INF/services/org.apache.spark.sql.sources.DataSourceRegister file with the fully-qualified class name or use the fully-qualified class name in format

Having a custom streaming source is a two-part development with developing the source (and optionally registering it with Spark SQL) and using it in a Spark Structured Streaming application (in Python) by means of format method.

146

answered Sep 29 '22 03:09

Jacek Laskowski

Related questions
                            
                                spark parquet write gets slow as partitions grow
                            
                                Unable to understand error "SparkListenerBus has already stopped! Dropping event ..."
                            
                                How are number of iterations and number of partitions releated in Apache spark Word2Vec?
                            
                                Spark: Difference between collect(), take() and show() outputs after conversion toDF
                            
                                Spark: Most efficient way to sort and partition data to be written as parquet
                            
                                Why increase spark.yarn.executor.memoryOverhead?
                            
                                Read an unsupported mix of union types from an Avro file in Apache Spark
                            
                                Exception with Table identified via AWS Glue Crawler and stored in Data Catalog
                            
                                Can't start Apache Spark on Windows using Cygwin
                            
                                Spark - Container is running beyond physical memory limits
                            
                                How to balance my data across the partitions?
                            
                                How to update Spark MatrixFactorizationModel for ALS
                            
                                From DataFrame to RDD[LabeledPoint]
                            
                                Running PySpark on and IDE like Spyder?
                            
                                Apache Spark YARN mode startup takes too long (10+ secs)
                            
                                PySpark: StructField(..., ..., False) always returns `nullable=true` instead of `nullable=false`
                            
                                Spark Streaming: foreachRDD update my mongo RDD
                            
                                SparkStreaming, RabbitMQ and MQTT in python using pika
                            
                                Spark structured streaming - join static dataset with streaming dataset
                            
                                How to find which Java/Scala thread has locked a file?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to load streaming data from Amazon SQS?

Tags:

amazon-sqs

apache-spark

pyspark-sql

spark-structured-streaming

OSK

People also ask

1 Answers

Jacek Laskowski

Recent Activity

Donate For Us