Within my mapper I'd like to call external software installed on the worker node outside of the HDFS. Is this possible? What is the best way to do this?
I understand that this may take some of the advantages/scalability of MapReduce away, but i'd like to interact both within the HDFS and call compiled/installed external software codes within my mapper to process some data.
RecordWriter in Hadoop MapReduce As we know, Reducer takes Mappers intermediate output as input. Then it runs a reducer function on them to generate output that is again zero or more key-value pairs. So, RecordWriter in MapReduce job execution writes these output key-value pairs from the Reducer phase to output files.
MapReduce assigns fragments of data across the nodes in a Hadoop cluster. The goal is to split a dataset into chunks and use an algorithm to process those chunks at the same time. The parallel processing on multiple machines greatly increases the speed of handling even petabytes of data.
MapReduce program executes in three stages, namely map stage, shuffle stage, and reduce stage. Map stage − The map or mapper's job is to process the input data. Generally the input data is in the form of file or directory and is stored in the Hadoop file system (HDFS).
Mappers (and reducers) are like any other process on the box- as long as the TaskTracker user has permission to run the executable, there is no problem doing so. There are a few ways to call external processes, but since we are already in Java, ProcessBuilder seems a logical place to start.
EDIT: Just found that Hadoop has a class explicitly for this purpose: http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/util/Shell.html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With