How to sort 100GB worth of strings

2 Answers

A1. You probably want to implement some form of merge-sort.

A2: Longer than it would if you had 256GB RAM on your machine.

Edit: stung by criticism, I quote from Wikipedia's article on merge sort:

Merge sort is so inherently sequential that it is practical to run it using slow tape drives as input and output devices. It requires very little memory, and the memory required does not depend on the number of data elements.

For the same reason it is also useful for sorting data on disk that is too large to fit entirely into primary memory. On tape drives that can run both backwards and forwards, merge passes can be run in both directions, avoiding rewind time.

166

answered Oct 01 '22 13:10

High Performance Mark

Here is how I'd do it:

Phase 1 is to split the 100Gb into 50 partitions of 2Gb, read each of the 50 partitions into memory, sort using quicksort, and write out. You want the sorted partitions at the top end of the disc.

Phase 2 is to then merge the 50 sorted partitions. This is the tricky bit because you don't have enough space on the disc to store the partitions AND the final sorted output. So ...

Do a 50-way merge to fill the first 20Gb at the bottom end of disc.
Slide the remaining data in the 50 partitions to the top to make another 20Gb of free space contiguous with the end of the first 20Gb.
Repeat steps 1. and 2. until completed.

This does a lot of disc IO, but you can make use of your 2Gb of memory for buffering in the copying and merging steps to get data throughput by minimizing the number of disc seeks, and do large data transfers.

EDIT - @meriton has proposed a clever way to reduce copying. Instead of sliding, he suggests that the partitions be sorted into reverse order and read backwards in the merge phase. This would allows the algorithm to release disc space used by partitions (phase 2, step 2) by simply truncating the partition files.

The potential downsides of this are increased disk fragmentation, and loss of performance due reading the partitions backwards. (On the latter point, reading a file backwards on Linux / UNIX requires more syscalls, and the FS implementation may not be able to do "read-ahead" in the reverse direction.)

Finally, I'd like to point out that any theoretically predictions of the time taken by this algorithm (and others) are largely guesswork. The behaviour of these algorithms on a real JVM + real OS + real discs are just too complex for "back for the envelope" calculations to give reliable answers. A proper treatment would require actual implementation, tuning and benchmarking.

answered Oct 01 '22 14:10

Stephen C

Related questions
                            
                                How to validate xml against xsd and get *ALL* errors?
                            
                                How to create a database with flyway?
                            
                                Get last n elements from stream
                            
                                Junit5 with IntelliJ and Gradle
                            
                                Why is a Java array index expression evaluated before checking if the array reference expression is null?
                            
                                The best CLI parser for Java [duplicate]
                            
                                Do you not need a password to access a truststore (made with the java keytool)?
                            
                                Java escape HTML
                            
                                FFT library in android Sdk [closed]
                            
                                "Comparison method violates its general contract!" - TimSort and GridLayout
                            
                                Postgres data type cast
                            
                                What is the Metadata GC Threshold and how do I tune it?
                            
                                How to reset between tests
                            
                                Elegantly implementing queue length indicators to ExecutorServices
                            
                                Is it better to use List or Collection?
                            
                                Java compilation errors limited to 100
                            
                                How to use Eclipse's Display View for Debug?
                            
                                How to exclude dependencies from maven assembly plugin : jar-with-dependencies?
                            
                                Reload Spring application context after every test
                            
                                Dividing long by long returns 0 [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to sort 100GB worth of strings

Tags:

java

sorting

Christian

People also ask

2 Answers

High Performance Mark

Stephen C

Recent Activity

Donate For Us