Spark: run an external process in parallel

Question

Is it possible with Spark to "wrap" and run an external process managing its input and output?

The process is represented by a normal C/C++ application that usually runs from command line. It accepts a plain text file as input and generate another plain text file as output. As I need to integrate the flow of this application with something bigger (always in Spark), I was wondering if there is a way to do this.

The process can be easily run in parallel (at the moment I use GNU Parallel) just splitting its input in (for example) 10 part files, run 10 instances in memory of it, and re-join the final 10 part files output in one file.

zero323 · Accepted Answer

The simplest thing you can do is to write a simple wrapper which takes data from standard input, writes to file, executes an external program, and outputs results to the standard output. After that all you have to do is to use pipe method:

rdd.pipe("your_wrapper")

The only serious considerations is IO performance. If it is possible it would be better to adjust program you want to call so it can read and write data directly without going through disk.

Alternativelly you can use mapPartitions combined with process and standard IO tools to write to the local file, call your program and read the output.

Spark: run an external process in parallel

Tags:

scala

apache-spark

Randomize

Video Answer

1 Answers

zero323

Recent Activity

Donate For Us

Spark: run an external process in parallel

Tags:

scala

apache-spark

Randomize

Video Answer

1 Answers

zero323

Related questions

Recent Activity

Donate For Us