Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark: run an external process in parallel

Is it possible with Spark to "wrap" and run an external process managing its input and output?

The process is represented by a normal C/C++ application that usually runs from command line. It accepts a plain text file as input and generate another plain text file as output. As I need to integrate the flow of this application with something bigger (always in Spark), I was wondering if there is a way to do this.

The process can be easily run in parallel (at the moment I use GNU Parallel) just splitting its input in (for example) 10 part files, run 10 instances in memory of it, and re-join the final 10 part files output in one file.

like image 416
Randomize Avatar asked Dec 04 '15 16:12

Randomize


Video Answer


1 Answers

The simplest thing you can do is to write a simple wrapper which takes data from standard input, writes to file, executes an external program, and outputs results to the standard output. After that all you have to do is to use pipe method:

rdd.pipe("your_wrapper")

The only serious considerations is IO performance. If it is possible it would be better to adjust program you want to call so it can read and write data directly without going through disk.

Alternativelly you can use mapPartitions combined with process and standard IO tools to write to the local file, call your program and read the output.

like image 199
zero323 Avatar answered Oct 08 '22 03:10

zero323