Let's say I have a file of several billion lines and the size is 500G to 1T. How can I produce a new file with the same lines, but randomly shuffle the lines? The shuffle should be completely random if can be achieved.
Create a mapper that maps a GUID to your line. The following Hadoop mapper illustrates the logic:
public class ShuffleMapper extends Mapper<LongWritable, Text, Text, Text> {
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
context.write(new Text(UUID.randomUUID().toString()), value);
}
}
In the reducer, you just collect the lines (values). This can be done using a single reducer, or if you face resource issues (e.g. a local disk fills up) you can split multiple reducers and then just concat lines from the commandline.
Note: This doesn't necessarily gives an unbiased shuffle like Fisher-Yates does, but this solution easier to implement and quite fast.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With