Spark Streaming - processing binary data file

Tags:

I have existing pyspark code to read binary data file from AWS S3 bucket. Other Spark/Python code will parse the bits in the data to convert into int, string, boolean and etc. Each binary file has one record of data.

In PYSPARK I read the binary file using: sc.binaryFiles("s3n://.......")

This is working great as it gives a tuple of (filename and the data) but I'm trying to find an equivalent PYSPARK streaming API to read binary file as a stream (hopefully the filename, too if can) .

I tried: binaryRecordsStream(directory, recordLength)

but I couldn't get this working...

Can anyone share some lights how PYSPARK streaming read binary data file?

744

asked Jun 29 '16 06:06

yhw82

2 Answers

In Spark Streaming, the relevant concept is the fileStream API, which is available in Scala and Java, but not in Python - noted here in the documentation: http://spark.apache.org/docs/latest/streaming-programming-guide.html#basic-sources. If the file you are reading can be read as a text file, you can use the textFileStream API

123

answered Nov 08 '22 17:11

JuJoDi

I had a similar question for Java Spark where I wanted to stream updates from S3, and there was no trivial solution, since the binaryRecordsStream(<path>,<record length>) API was only for fixed byte length records, and couldn't find an obvious equivalent to JavaSparkContext.binaryFiles(<path>). The solution, after reading what binaryFiles() does under the covers was to do this:

JavaPairInputDStream<String, PortableDataStream> rawAuctions = 
        sc.fileStream("s3n://<bucket>/<folder>", 
                String.class, PortableDataStream.class, StreamInputFormat.class);

Then parse the individual byte messages from the PortableDataStream objects. I apologize for the Java context, but perhaps there is something similar you can do with PYSPARK.

answered Nov 08 '22 18:11

Marcus

Related questions
                            
                                How do I get a PySpark DataFrame made using HiveContext in Spark 1.5.2?
                            
                                Pyspark Column.isin() for a large set
                            
                                How to get iPython inbuild magic command to work in Jupyter notebook Pyspark kernel?
                            
                                Using Pycuda with PySpark - nvcc not found
                            
                                AttributeError: 'NoneType' object has no attribute 'sc'
                            
                                Cosine similarity of word2vec more than 1
                            
                                How to write a dataframe in pyspark having null values to CSV
                            
                                How much copies of the environment does spark do?
                            
                                Python multiprocessing tool vs Py(Spark)
                            
                                Pyspark groupby then sort within group
                            
                                python spark: narrowing down most relevant features using PCA
                            
                                Why is groupBy() a lot faster than distinct() in pyspark?
                            
                                How to apply the describe function after grouping a PySpark DataFrame?
                            
                                How to log/print message in pyspark pandas_udf?
                            
                                py4JJava Error - error while using select statement
                            
                                Dependency issue with Pyspark running on Kubernetes using spark-on-k8s-operator
                            
                                How can I inspect per executor/node memory usage metrics of a pyspark job on Dataproc?
                            
                                Partitions not being pruned in simple SparkSQL queries
                            
                                Calculating standard error of estimate, Wald-Chi Square statistic, p-value with logistic regression in Spark

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark Streaming - processing binary data file

Tags:

pyspark

spark-streaming

yhw82

People also ask

2 Answers

JuJoDi

Marcus

Recent Activity

Donate For Us