Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Shuffle large data file(s) in mapreduce

Let's say I have a file of several billion lines and the size is 500G to 1T. How can I produce a new file with the same lines, but randomly shuffle the lines? The shuffle should be completely random if can be achieved.

like image 749
Rainfield Avatar asked Oct 24 '25 18:10

Rainfield


1 Answers

Create a mapper that maps a GUID to your line. The following Hadoop mapper illustrates the logic:

public class ShuffleMapper extends Mapper<LongWritable, Text, Text, Text> {
  @Override
  protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
    context.write(new Text(UUID.randomUUID().toString()), value);
  }
}

In the reducer, you just collect the lines (values). This can be done using a single reducer, or if you face resource issues (e.g. a local disk fills up) you can split multiple reducers and then just concat lines from the commandline.

Note: This doesn't necessarily gives an unbiased shuffle like Fisher-Yates does, but this solution easier to implement and quite fast.

like image 157
Thomas Jungblut Avatar answered Oct 26 '25 08:10

Thomas Jungblut



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!