Shuffle large data file(s) in mapreduce

Question

Let's say I have a file of several billion lines and the size is 500G to 1T. How can I produce a new file with the same lines, but randomly shuffle the lines? The shuffle should be completely random if can be achieved.

Thomas Jungblut · Accepted Answer

Create a mapper that maps a GUID to your line. The following Hadoop mapper illustrates the logic:

public class ShuffleMapper extends Mapper<LongWritable, Text, Text, Text> {
  @Override
  protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
    context.write(new Text(UUID.randomUUID().toString()), value);
  }
}

In the reducer, you just collect the lines (values). This can be done using a single reducer, or if you face resource issues (e.g. a local disk fills up) you can split multiple reducers and then just concat lines from the commandline.

Note: This doesn't necessarily gives an unbiased shuffle like Fisher-Yates does, but this solution easier to implement and quite fast.

Shuffle large data file(s) in mapreduce

Tags:

java

algorithm

mapreduce

Rainfield

1 Answers

Thomas Jungblut

Recent Activity

Donate For Us

Shuffle large data file(s) in mapreduce

Tags:

java

algorithm

mapreduce

Rainfield

1 Answers

Thomas Jungblut

Related questions

Recent Activity

Donate For Us