Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Map Reduce output to CSV or do I need Key Values?

My map function produces a

Key\tValue

Value = List(value1, value2, value3)

then my reduce function produces:

Key\tCSV-Line

Ex.


2323232-2322 fdsfs,sdfs,dfsfs,0,0,0,2,fsda,3,23,3,s,

2323555-22222 dfasd,sdfas,adfs,0,0,2,0,fasafa,2,23,s


Ex. RawData: 232342|@3423@|34343|sfasdfasdF|433443|Sfasfdas|324343 x 1000

Anyway I want to eliminate the key's at the beginning of that so my client can do a straight import into mysql. I have about 50 data files, my question is after it maps them once and the reducer starts does it need the key printed out with the value or can I just print the value?


More information:

Here this code might shine some better light on the situation

http://pastebin.ca/2410217

this is kinda what I plan to do.

like image 426
Jake Steele Avatar asked Jun 26 '13 23:06

Jake Steele


People also ask

Where does the output of MapReduce get stored?

All inputs and outputs are stored in the HDFS. While the map is a mandatory step to filter and sort the initial data, the reduce function is optional.

What is the use of key value pair in MapReduce?

Hadoop MapReduce uses a key-value pair to process the data in an efficient manner. The MapReduce concept is actually derived from Google white papers which uses this concept. Key-value pairs are not part of the input data, but rather the input data is split in the form of key and value to be processed in the mapper.

Which is optional in MapReduce?

A Combiner, also known as a semi-reducer, is an optional class that operates by accepting the inputs from the Map class and thereafter passing the output key-value pairs to the Reducer class.

What are the key and pair of MapReduce?

The key-value pair in MapReduce is the record entity that Hadoop MapReduce accepts for execution. We use Hadoop mainly for data analysis. It deals with structured, unstructured, and semi-structured data. With Hadoop, if the schema is static we can precisely work on the column in the place of key value.


2 Answers

If you do not want to emit the key set it to NullWritable in your code. For example :

public static class TokenCounterReducer extends
            Reducer<Text, IntWritable, NullWritable, IntWritable> {
        public void reduce(Text key, Iterable<IntWritable> values,
                Context context) throws IOException, InterruptedException {
            int sum = 0;
            for (IntWritable value : values) {
                sum += value.get();
            }
            context.write(NullWritable.get(), new IntWritable(sum));
//          context.write(key, new IntWritable(sum));
        }

Let me know if this is not what you need, i'll update the answer accordingly.

like image 167
Tariq Avatar answered Sep 22 '22 00:09

Tariq


Your reducer can emit a line without \t, or, in your case, just what you're calling the value. Unfortunately, hadoop streaming will interpret this as a key with a null value and automatically append a delimiter (\t by default) to the end of each line. You can change what this delimiter is but, when I played around with this, I could not get it to not append a delimiter. I don't remember the exact details but based on this (Hadoop: key and value are tab separated in the output file. how to do it semicolon-separated?) I think the property is mapred.textoutputformat.separator. My solution was to strip the \t at the end of each line as I pulled the file back:

hadoop fs -cat hadoopfile | perl -pe 's/\t$//' > destfile
like image 27
John Pickard Avatar answered Sep 21 '22 00:09

John Pickard