I am using This for removing Duplicate lines
public class DLines
{
public static class TokenCounterMapper extends Mapper<Object, Text, Text, IntWritable>
{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
@Override
public void map(Object key, Text value, Context context) throws IOException, InterruptedException
{
String line=value.toString();
//int hash_code=line.hashCode();
context.write(value, one);
}
}
public static class TokenCounterReducer extends Reducer<Text, IntWritable, Text, IntWritable>
{
@Override
public void reduce(Text key, Iterable<IntWritable> values, Context context)throws IOException, InterruptedException
{
public void reduce(Text key, Iterable<IntWritable> values, Context context)throws IOException, InterruptedException
{
int sum = 0;
for (IntWritable value : values)
{
sum += value.get();
}
if (sum<2)
{
context.write(key,new IntWritable(sum));
}
}
}
i have to store only Key in hdfs.
MapReduce programming offers several benefits to help you gain valuable insights from your big data: Scalability. Businesses can process petabytes of data stored in the Hadoop Distributed File System (HDFS).
Storing data in NoSQL databases can provide a key-value storage model. However, HDFS is a dostributed file storage in Hadoop ecosystem. Key-value is used by mapreduce clusters. Therefore, this distribution is generated in processing phase only.
All inputs and outputs are stored in the HDFS. While the map is a mandatory step to filter and sort the initial data, the reduce function is optional.
If you do not require value from your reducer, just use NullWritable.
You could simply say context.write(key,NullWritable.get());
In you driver, you could also set
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
&
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(NullWritable.class);
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With