Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

hadoop 2.4.0 streaming generic parser options using TAB as separator

I know that the tab is default input separator for fields :

stream.map.output.field.separator
stream.reduce.input.field.separator
stream.reduce.output.field.separator
mapreduce.textoutputformat.separator

but if i try to write the generic parser option :

stream.map.output.field.separator=\t (or)  
stream.map.output.field.separator="\t"

to test how hadoop parses white space characters like "\t,\n,\f" when used as separators. I observed that hadoop reads it as \t character but not "" tab space itself. I checked it by printing each line in reducer (python) as it reads using :

sys.stdout.write(str(line))

My mapper emits key/value pairs as : key value1 value2

using print (key,value1,value2,sep='\t',end='\n') command.

So I expected my reducer to read each line as : key value1 value2 too, but instead sys.stdout.write(str(line)) printed :

key value1 value2 \\with trailing space

From Hadoop streaming - remove trailing tab from reducer output, I understood that the trailing space is due to mapreduce.textoutputformat.separator not being set and left as default.

So, this confirmed my assumption that hadoop considered my total map output :

key value1 value2

as key and value as empty Text object since it read the separator from stream.map.output.field.separator=\t as "\t" character instead of "" tab space itself.

Please help me understand this behavior and how can I use \t as a separator if I want to.

like image 668
annunarcist Avatar asked May 27 '15 18:05

annunarcist


People also ask

What is genericoptionsparser in Hadoop?

GenericOptionsParser is a utility to parse command line arguments generic to the Hadoop framework. GenericOptionsParser recognizes several standarad command line arguments, enabling applications to easily specify a namenode, a jobtracker, additional configuration resources etc.

What are the Hadoop generic command options to use with streaming?

The Hadoop generic command options you can use with streaming are listed here: You can specify additional configuration variables by using “-D <property>=<value>”. To change the local temp directory use: To specify additional local temp directories use: -D mapred.local.dir=/tmp/local -D mapred.system.dir=/tmp/system -D mapred.temp.dir=/tmp/temp

How do I parse Hadoop arguments in command line?

Create a GenericOptionsParser to parse only the generic Hadoop arguments. Create an options parser with the given options to parse the args. Create an options parser to parse the args. Returns the commons-cli CommandLine object to process the parsed arguments.

How do I set the number of reducers in Hadoop Streaming?

To be backward compatible, Hadoop Streaming also supports the “-reducer NONE” option, which is equivalent to “-D mapreduce.job.reduces=0”. To specify the number of reducers, for example two, use: mapred streaming \ -D mapreduce.job.reduces=2 \ -input myInputDirs \ -output myOutputDir \ -mapper /bin/cat \ -reducer /usr/bin/wc


1 Answers

You might be having this issue "-D stream.map.output.field.separator=." specifies "." as the field separator for the map outputs, and the prefix up to the fourth "." in a line will be the key and the rest of the line (excluding the fourth ".") will be the value. If a line has less than four "."s, then the whole line will be the key and the value will be an empty Text object (like the one created by new Text("")). Here it is clearly mentioned how the separator is being used, and also how many of such separator occurences needs to be considered, when identifying map key and value. Also there are fields related to partitioning, based on which the reducer will be handled. As you want the separator to be changed, I think, you have to verify this also related to partitioning and reducer.

like image 162
Ramzy Avatar answered Sep 22 '22 21:09

Ramzy