Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark Streaming - processing binary data file

I'm using pyspark 1.6.0.

I have existing pyspark code to read binary data file from AWS S3 bucket. Other Spark/Python code will parse the bits in the data to convert into int, string, boolean and etc. Each binary file has one record of data.

In PYSPARK I read the binary file using: sc.binaryFiles("s3n://.......")

This is working great as it gives a tuple of (filename and the data) but I'm trying to find an equivalent PYSPARK streaming API to read binary file as a stream (hopefully the filename, too if can) .

I tried: binaryRecordsStream(directory, recordLength)

but I couldn't get this working...

Can anyone share some lights how PYSPARK streaming read binary data file?

like image 744
yhw82 Avatar asked Jun 29 '16 06:06

yhw82


People also ask

How does Spark process Streaming data?

Spark Streaming receives live input data streams and divides the data into batches, which are then processed by the Spark engine to generate the final stream of results in batches. Spark Streaming provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data.

What is binary stream format?

A binary format is a format in which file information is stored in the form of ones and zeros, or in some other binary (two-state) sequence. This type of format is often used for executable files and numeric information in computer programming and memory.

How does New data arriving in a stream get represented in structured Streaming?

Every data item that is arriving on the stream is like a new row being appended to the Input Table. A query on the input will generate the “Result Table”. Every trigger interval (say, every 1 second), new rows get appended to the Input Table, which eventually updates the Result Table.


2 Answers

In Spark Streaming, the relevant concept is the fileStream API, which is available in Scala and Java, but not in Python - noted here in the documentation: http://spark.apache.org/docs/latest/streaming-programming-guide.html#basic-sources. If the file you are reading can be read as a text file, you can use the textFileStream API

like image 123
JuJoDi Avatar answered Nov 08 '22 17:11

JuJoDi


I had a similar question for Java Spark where I wanted to stream updates from S3, and there was no trivial solution, since the binaryRecordsStream(<path>,<record length>) API was only for fixed byte length records, and couldn't find an obvious equivalent to JavaSparkContext.binaryFiles(<path>). The solution, after reading what binaryFiles() does under the covers was to do this:

JavaPairInputDStream<String, PortableDataStream> rawAuctions = 
        sc.fileStream("s3n://<bucket>/<folder>", 
                String.class, PortableDataStream.class, StreamInputFormat.class);

Then parse the individual byte messages from the PortableDataStream objects. I apologize for the Java context, but perhaps there is something similar you can do with PYSPARK.

like image 44
Marcus Avatar answered Nov 08 '22 18:11

Marcus