External shuffle: shuffling large amount of data out of memory

Question

I am looking for a way to shuffle a large amount of data which does not fit into memory (approx. 40GB).

I have around 30 millions entries, of variable length, stored in one large file. I know the starting and ending positions of each entry in that file. I need to shuffle this data which does not fit in the RAM.

The only solution I thought of is to shuffle an array containing the numbers from 1 to N, where N is the number of entries, with the Fisher-Yates algorithm and then copy the entries in a new file, according to this order. Unfortunately, this solution involves a lot of seek operations, and thus, would be very slow.

Is there a better solution to shuffle large amount of data with uniform distribution?

OldCurmudgeon · Accepted Answer

First get the shuffle issue out of your face. Do this by inventing a hash algorithm for your entries that produces random-like results, then do a normal external sort on the hash.

Now you have transformed your shuffle into a sort your problems turn into finding an efficient external sort algorithm that fits your pocket and memory limits. That should now be as easy as google.

Paul Hankin · Answer

A simple approach is to pick a K such that 1/K of the data fits comfortably in memory. Perhaps K=4 for your data, assuming you've got 16GB RAM. I'll assume your random number function has the form rnd(n) which generates a uniform random number from 0 to n-1.

Then:

for i = 0 .. K-1
   Initialize your random number generator to a known state.
   Read through the input data, generating a random number rnd(K) for each item as you go.
   Retain items in memory whenever rnd(K) == i.
   After you've read the input file, shuffle the retained data in memory.
   Write the shuffled retained items to the output file.

This is very easy to implement, will avoid a lot of seeking, and is clearly correct.

An alternative is to partition the input data into K files based on the random numbers, and then go through each, shuffling in memory and writing to disk. This reduces disk IO (each item is read twice and written twice, compared to the first approach where each item is read K times and written once), but you need to be careful to buffer the IO to avoid a lot of seeking, it uses more intermediate disk, and is somewhat more difficult to implement. If you've got only 40GB of data (so K is small), then the simple approach of multiple iterations through the input data is probably best.

If you use 20ms as the time for reading or writing 1MB of data (and assuming the in-memory shuffling cost is insignificant), the simple approach will take 40*1024*(K+1)*20ms, which is 1 minute 8 seconds (assuming K=4). The intermediate-file approach will take 40*1024*4*20ms, which is around 55 seconds, assuming you can minimize seeking. Note that SSD is approximately 20 times faster for reads and writes (even ignoring seeking), so you should expect to perform this task in well under 10s using an SSD. Numbers from Latency Numbers every Programmer should know

Patricia Shanahan · Answer

I suggest keeping your general approach, but inverting the map before doing the actual copy. That way, you read sequentially and do scattered writes rather than the other way round.

A read has to be done when requested before the program can continue. A write can be left in a buffer, increasing the probability of accumulating more than one write to the same disk block before actually doing the write.

Spatarel · Answer

Premise

From what I understand, using the Fisher-Yates algorithm and the data you have about the positions of the entries, you should be able to obtain (and compute) a list of:

struct Entry {
    long long sourceStartIndex;
    long long sourceEndIndex;
    long long destinationStartIndex;
    long long destinationEndIndex;
}

Problem

From this point onward, the naive solution is to seek each entry in the source file, read it, then seek to the new position of the entry in the destination file and write it.

The problem with this approach is that it uses way too many seeks.

Solution

A better way to do it, is to reduce the number of seeks, using two huge buffers, for each of the files.

I recommend a small buffer for the source file (say 64MB) and a big one for the destination file (as big as the user can afford - say 2GB).

Initially, the destination buffer will be mapped to the first 2GB of the destination file. At this point, read the whole source file, in chunks of 64MB, in the source buffer. As you read it, copy the proper entries to the destination buffer. When you reach the end of the file, the output buffer should contain all the proper data. Write it to the destination file.

Next, map the output buffer to the next 2GB of the destination file and repeat the procedure. Continue until you have wrote the whole output file.

Caution

Since the entries have arbitrary sizes, it's very likely that at the beginning and ending of the buffers you will have suffixes and prefixes of entries, so you need to make sure you copy the data properly!

Estimated time costs

The execution time depends, essentially, on the size of the source file, the available RAM for the application and the reading speed of the HDD. Assuming a 40GB file, a 2GB RAM and a 200MB/s HDD read speed, the program will need to read 800GB of data (40GB * (40GB / 2GB)). Assuming the HDD is not highly fragmented, the time spent on seeks will be negligible. This means the reads will take up one hour! But if, luckily, the user has 8GB of RAM available for your application, the time may decrease to only 15 to 20 minutes.

I hope this will be enough for you, as I don't see any other faster way.

External shuffle: shuffling large amount of data out of memory

Tags:

java

algorithm

bigdata

Edouard

4 Answers

OldCurmudgeon

Paul Hankin

Patricia Shanahan

Premise

Problem

Solution

Caution

Estimated time costs

Spatarel

Recent Activity

Donate For Us

External shuffle: shuffling large amount of data out of memory

Tags:

java

algorithm

bigdata

Edouard

4 Answers

OldCurmudgeon

Paul Hankin

Patricia Shanahan

Premise

Problem

Solution

Caution

Estimated time costs

Spatarel

Related questions

Recent Activity

Donate For Us