Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to 'Pipe' Binary Data in Apache Spark

Tags:

apache-spark

I have an RDD containing binary data. I would like to use 'RDD.pipe' to pipe that binary data to an external program that will translate it to string/text data. Unfortunately, it seems that Spark is mangling the binary data before it gets passed to the external program.

This code is representative of what I am trying to do. What am I doing wrong? How can I pipe binary data in Spark?

bin = sc.textFile("binary-data.dat")
csv = bin.pipe ("/usr/bin/binary-to-csv.sh")
csv.saveAsTextFile("text-data.csv")

Specifically, I am trying to use Spark to transform pcap (packet capture) data to text/csv so that I can perform an analysis on it.

like image 811
Nick Allen Avatar asked Jan 16 '15 15:01

Nick Allen


1 Answers

The problem is not from my use of 'pipe', but that 'textFile' cannot be used to read in binary data. (Doh) There are a couple options to move forward.

  1. Implement a custom 'InputFormat' that understands the binary input data. (Many thanks to Sean Owen of Cloudera for pointing this out.)

  2. Use 'SparkContext.binaryFiles' to read in the entire binary file as a single record. This will impact performance as it prevents the use of more than one mapper on the file's data.

In my specific case for #1 I can only find one project from RIPE-NCC that does this. Unfortunately, it appears to only support a limited set of network protocols.

like image 188
Nick Allen Avatar answered Oct 03 '22 06:10

Nick Allen