I know that the tab is default input separator for fields : <pre class="prettyprint"><code>stream.map.output.field.separator stream.reduce.input.field.separator stream.reduce.output.field.separator mapreduce.textoutputformat.separator </code></pre> but if i try to write the generic parser option : <pre class="prettyprint"><code>stream.map.output.field.separator=\t (or) stream.map.output.field.separator="\t" </code></pre> to test how hadoop parses white space characters like "\t,\n,\f" when used as separators. I observed that hadoop reads it as \t character but not "<code></code>" tab space itself. I checked it by printing each line in reducer (python) as it reads using : <pre class="prettyprint"><code>sys.stdout.write(str(line)) </code></pre> My mapper emits key/value pairs as : <code>key value1 value2</code> using <code>print (key,value1,value2,sep='\t',end='\n')</code> command. So I expected my reducer to read each line as : <code>key value1 value2</code> too, but instead <code>sys.stdout.write(str(line))</code> printed : <code>key value1 value2 \\with trailing space</code> From Hadoop streaming - remove trailing tab from reducer output, I understood that the trailing space is due to <code>mapreduce.textoutputformat.separator</code> not being set and left as default. So, this confirmed my assumption that hadoop considered my total map output : <code>key value1 value2</code> as key and value as empty Text object since it read the separator from <code>stream.map.output.field.separator=\t</code> as "\t" character instead of "<code></code>" tab space itself. Please help me understand this behavior and how can I use \t as a separator if I want to.

You might be having this issue "-D stream.map.output.field.separator=." specifies "." as the field separator for the map outputs, and the prefix up to the fourth "." in a line will be the key and the rest of the line (excluding the fourth ".") will be the value. If a line has less than four "."s, then the whole line will be the key and the value will be an empty Text object (like the one created by new Text("")). Here it is clearly mentioned how the separator is being used, and also how many of such separator occurences needs to be considered, when identifying map key and value. Also there are fields related to partitioning, based on which the reducer will be handled. As you want the separator to be changed, I think, you have to verify this also related to partitioning and reducer.

hadoop 2.4.0 streaming generic parser options using TAB as separator

Tags:

python

utf-8

hadoop

mapreduce

hadoop-streaming

I know that the tab is default input separator for fields :

stream.map.output.field.separator
stream.reduce.input.field.separator
stream.reduce.output.field.separator
mapreduce.textoutputformat.separator

but if i try to write the generic parser option :

stream.map.output.field.separator=\t (or)  
stream.map.output.field.separator="\t"

to test how hadoop parses white space characters like "\t,\n,\f" when used as separators. I observed that hadoop reads it as \t character but not "" tab space itself. I checked it by printing each line in reducer (python) as it reads using :

sys.stdout.write(str(line))

My mapper emits key/value pairs as : key value1 value2

using print (key,value1,value2,sep='\t',end='\n') command.

So I expected my reducer to read each line as : key value1 value2 too, but instead sys.stdout.write(str(line)) printed :

key value1 value2 \\with trailing space

From Hadoop streaming - remove trailing tab from reducer output, I understood that the trailing space is due to mapreduce.textoutputformat.separator not being set and left as default.

So, this confirmed my assumption that hadoop considered my total map output :

key value1 value2

as key and value as empty Text object since it read the separator from stream.map.output.field.separator=\t as "\t" character instead of "" tab space itself.

Please help me understand this behavior and how can I use \t as a separator if I want to.

668

asked May 27 '15 18:05

annunarcist

1 Answers

You might be having this issue "-D stream.map.output.field.separator=." specifies "." as the field separator for the map outputs, and the prefix up to the fourth "." in a line will be the key and the rest of the line (excluding the fourth ".") will be the value. If a line has less than four "."s, then the whole line will be the key and the value will be an empty Text object (like the one created by new Text("")). Here it is clearly mentioned how the separator is being used, and also how many of such separator occurences needs to be considered, when identifying map key and value. Also there are fields related to partitioning, based on which the reducer will be handled. As you want the separator to be changed, I think, you have to verify this also related to partitioning and reducer.

162

answered Sep 22 '22 21:09

Ramzy

Related questions
                            
                                Running Flower using Supervisor
                            
                                How can I import submodules of pandas without importing matplotlib?
                            
                                How to use a refresh_token to get a new access_token (using Flask-OAuthLib)?
                            
                                Using self in toctree doesn't include sub-sections
                            
                                how do you connect to oracle using pyodbc
                            
                                How to use a thread pool to do infinite loop function?
                            
                                Loading Analyze 7.5 format images in python
                            
                                How can I get the volume of sound of a video in Python using moviepy?
                            
                                Generic Relations/Generic Foreign Keys in the Django Admin
                            
                                Maximum Recursion Depth Exceeded py2App
                            
                                registering kernels in ipython/jupyter notebook - kernel.json
                            
                                Spark job failing when calling first() in PySpark
                            
                                Python opening and reading files one liner
                            
                                Loading an eventstream through Gunicorn + Flask
                            
                                Token pattern for n-gram in TfidfVectorizer in python
                            
                                Pandas, Computing total sum on each MultiIndex sublevel
                            
                                matplotlib scatter plots do not display when populated using for loop
                            
                                NoSuchKey when getting a signed url for a cloudstorage object with a space in the name
                            
                                Rephrase spirograph code into function
                            
                                Why does the single backslash raw string in Python cause a syntax error?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With