Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to load streaming data from Amazon SQS?

I use Spark 2.2.0.

How can I feed Amazon SQS stream to spark structured stream using pyspark?

This question tries to answer it for a non structured streaming and for scala by creating a custom receiver.
Is something similar possible in pyspark?

spark.readStream \
   .format("s3-sqs") \
   .option("fileFormat", "json") \
   .option("queueUrl", ...) \
   .schema(...) \
   .load()

According to Databricks above receiver can be used for S3-SQS file source. However, for only SQS how may one approach.

I tried understanding from AWS-SQS-Receive_Message to receive message. However, how to directly send stream to spark streaming was not clear.

like image 581
OSK Avatar asked Dec 28 '17 12:12

OSK


People also ask

Which of the following option is used to load data streams into AWS data stores?

Amazon Kinesis Data Firehose is the easiest way to capture, transform, and load data streams into AWS data stores for near real-time analytics with existing business intelligence tools.

How do you handle streaming data?

There are three ways to deal with streaming data: batch process it at intervals ranging from hours to days, process the stream in real time, or do both in a hybrid process. Batch processing has the advantage of being able to perform deep analysis, including machine learning, and the disadvantage of having high latency.

How do I load data into Kinesis?

To put data into the stream, you must specify the name of the stream, a partition key, and the data blob to be added to the stream. The partition key is used to determine which shard in the stream the data record is added to. All the data in the shard is sent to the same worker that is processing the shard.


1 Answers

I know nothing about Amazon SQS, but "how can I feed Amazon SQS stream to spark structured stream using pyspark." is not possible with any external messaging system or a data source using Spark Structured Streaming (aka Spark "Streams").

It's the other way round in Spark Structured Streaming when it is Spark to pull data in at regular intervals (similarly to the way Kafka Consumer API works where it pulls data in not is given it).

In other words, Spark "Streams" is just another consumer of messages from a "queue" in Amazon SQS.

Whenever I'm asked to integrate an external system with Spark "Streams" I start writing a client for the system using the client/consumer API.

Once I have it, the next step is to develop a custom streaming Source for the external system, e.g. Amazon SQS, using the sample client code above.

While developing a custom streaming Source you have to do the following steps:

  1. Write a Scala class that implements the Source trait

  2. Register the Scala class (the custom Source) with Spark SQL using META-INF/services/org.apache.spark.sql.sources.DataSourceRegister file with the fully-qualified class name or use the fully-qualified class name in format

Having a custom streaming source is a two-part development with developing the source (and optionally registering it with Spark SQL) and using it in a Spark Structured Streaming application (in Python) by means of format method.

like image 146
Jacek Laskowski Avatar answered Sep 29 '22 03:09

Jacek Laskowski