I have an RDD containing binary data. I would like to use 'RDD.pipe' to pipe that binary data to an external program that will translate it to string/text data. Unfortunately, it seems that Spark is mangling the binary data before it gets passed to the external program.
This code is representative of what I am trying to do. What am I doing wrong? How can I pipe binary data in Spark?
bin = sc.textFile("binary-data.dat")
csv = bin.pipe ("/usr/bin/binary-to-csv.sh")
csv.saveAsTextFile("text-data.csv")
Specifically, I am trying to use Spark to transform pcap (packet capture) data to text/csv so that I can perform an analysis on it.
The problem is not from my use of 'pipe', but that 'textFile' cannot be used to read in binary data. (Doh) There are a couple options to move forward.
Implement a custom 'InputFormat' that understands the binary input data. (Many thanks to Sean Owen of Cloudera for pointing this out.)
Use 'SparkContext.binaryFiles' to read in the entire binary file as a single record. This will impact performance as it prevents the use of more than one mapper on the file's data.
In my specific case for #1 I can only find one project from RIPE-NCC that does this. Unfortunately, it appears to only support a limited set of network protocols.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With