My map function produces a
Key\tValue
Value = List(value1, value2, value3)
then my reduce function produces:
Key\tCSV-Line
Ex.
2323232-2322 fdsfs,sdfs,dfsfs,0,0,0,2,fsda,3,23,3,s,
2323555-22222 dfasd,sdfas,adfs,0,0,2,0,fasafa,2,23,s
Ex. RawData:
232342|@3423@|34343|sfasdfasdF|433443|Sfasfdas|324343
x 1000
Anyway I want to eliminate the key's at the beginning of that so my client can do a straight import into mysql. I have about 50 data files, my question is after it maps them once and the reducer starts does it need the key printed out with the value or can I just print the value?
More information:
Here this code might shine some better light on the situation
http://pastebin.ca/2410217
this is kinda what I plan to do.
All inputs and outputs are stored in the HDFS. While the map is a mandatory step to filter and sort the initial data, the reduce function is optional.
Hadoop MapReduce uses a key-value pair to process the data in an efficient manner. The MapReduce concept is actually derived from Google white papers which uses this concept. Key-value pairs are not part of the input data, but rather the input data is split in the form of key and value to be processed in the mapper.
A Combiner, also known as a semi-reducer, is an optional class that operates by accepting the inputs from the Map class and thereafter passing the output key-value pairs to the Reducer class.
The key-value pair in MapReduce is the record entity that Hadoop MapReduce accepts for execution. We use Hadoop mainly for data analysis. It deals with structured, unstructured, and semi-structured data. With Hadoop, if the schema is static we can precisely work on the column in the place of key value.
If you do not want to emit the key set it to NullWritable
in your code. For example :
public static class TokenCounterReducer extends
Reducer<Text, IntWritable, NullWritable, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values,
Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable value : values) {
sum += value.get();
}
context.write(NullWritable.get(), new IntWritable(sum));
// context.write(key, new IntWritable(sum));
}
Let me know if this is not what you need, i'll update the answer accordingly.
Your reducer can emit a line without \t, or, in your case, just what you're calling the value. Unfortunately, hadoop streaming will interpret this as a key with a null value and automatically append a delimiter (\t by default) to the end of each line. You can change what this delimiter is but, when I played around with this, I could not get it to not append a delimiter. I don't remember the exact details but based on this (Hadoop: key and value are tab separated in the output file. how to do it semicolon-separated?) I think the property is mapred.textoutputformat.separator. My solution was to strip the \t at the end of each line as I pulled the file back:
hadoop fs -cat hadoopfile | perl -pe 's/\t$//' > destfile
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With