I know that the tab is default input separator for fields :
stream.map.output.field.separator
stream.reduce.input.field.separator
stream.reduce.output.field.separator
mapreduce.textoutputformat.separator
but if i try to write the generic parser option :
stream.map.output.field.separator=\t (or)
stream.map.output.field.separator="\t"
to test how hadoop parses white space characters like "\t,\n,\f" when used as separators. I observed that hadoop reads it as \t character but not "" tab space itself. I checked it by printing each line in reducer (python) as it reads using :
sys.stdout.write(str(line))
My mapper emits key/value pairs as : key value1 value2
using print (key,value1,value2,sep='\t',end='\n')
command.
So I expected my reducer to read each line as : key value1 value2
too, but instead sys.stdout.write(str(line))
printed :
key value1 value2 \\with trailing space
From Hadoop streaming - remove trailing tab from reducer output, I understood that the trailing space is due to mapreduce.textoutputformat.separator
not being set and left as default.
So, this confirmed my assumption that hadoop considered my total map output :
key value1 value2
as key and value as empty Text object since it read the separator from stream.map.output.field.separator=\t
as "\t" character instead of "" tab space itself.
Please help me understand this behavior and how can I use \t as a separator if I want to.
GenericOptionsParser is a utility to parse command line arguments generic to the Hadoop framework. GenericOptionsParser recognizes several standarad command line arguments, enabling applications to easily specify a namenode, a jobtracker, additional configuration resources etc.
The Hadoop generic command options you can use with streaming are listed here: You can specify additional configuration variables by using “-D <property>=<value>”. To change the local temp directory use: To specify additional local temp directories use: -D mapred.local.dir=/tmp/local -D mapred.system.dir=/tmp/system -D mapred.temp.dir=/tmp/temp
Create a GenericOptionsParser to parse only the generic Hadoop arguments. Create an options parser with the given options to parse the args. Create an options parser to parse the args. Returns the commons-cli CommandLine object to process the parsed arguments.
To be backward compatible, Hadoop Streaming also supports the “-reducer NONE” option, which is equivalent to “-D mapreduce.job.reduces=0”. To specify the number of reducers, for example two, use: mapred streaming \ -D mapreduce.job.reduces=2 \ -input myInputDirs \ -output myOutputDir \ -mapper /bin/cat \ -reducer /usr/bin/wc
You might be having this issue "-D stream.map.output.field.separator=." specifies "." as the field separator for the map outputs, and the prefix up to the fourth "." in a line will be the key and the rest of the line (excluding the fourth ".") will be the value. If a line has less than four "."s, then the whole line will be the key and the value will be an empty Text object (like the one created by new Text("")). Here it is clearly mentioned how the separator is being used, and also how many of such separator occurences needs to be considered, when identifying map key and value. Also there are fields related to partitioning, based on which the reducer will be handled. As you want the separator to be changed, I think, you have to verify this also related to partitioning and reducer.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With