How to 'Pipe' Binary Data in Apache Spark

Question

I have an RDD containing binary data. I would like to use 'RDD.pipe' to pipe that binary data to an external program that will translate it to string/text data. Unfortunately, it seems that Spark is mangling the binary data before it gets passed to the external program.

This code is representative of what I am trying to do. What am I doing wrong? How can I pipe binary data in Spark?

bin = sc.textFile("binary-data.dat")
csv = bin.pipe ("/usr/bin/binary-to-csv.sh")
csv.saveAsTextFile("text-data.csv")

Specifically, I am trying to use Spark to transform pcap (packet capture) data to text/csv so that I can perform an analysis on it.

Nick Allen · Accepted Answer

The problem is not from my use of 'pipe', but that 'textFile' cannot be used to read in binary data. (Doh) There are a couple options to move forward.

Implement a custom 'InputFormat' that understands the binary input data. (Many thanks to Sean Owen of Cloudera for pointing this out.)
Use 'SparkContext.binaryFiles' to read in the entire binary file as a single record. This will impact performance as it prevents the use of more than one mapper on the file's data.

In my specific case for #1 I can only find one project from RIPE-NCC that does this. Unfortunately, it appears to only support a limited set of network protocols.

How to 'Pipe' Binary Data in Apache Spark

Tags:

apache-spark

Nick Allen

1 Answers

Nick Allen

Recent Activity

Donate For Us

How to 'Pipe' Binary Data in Apache Spark

Tags:

apache-spark

Nick Allen

1 Answers

Nick Allen

Related questions

Recent Activity

Donate For Us