Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

sorting 50 000 000 numbers

Suppose, that we need to sort 50 000 000 numbers. suppose, that the numbers is stored in a file. What is the most efficient algorithm for solving this problem? Parallel algorithm for sorting...

How to do it? Maybe useful link )

I can't use standard algorithm

Therefore i ask you about methods and algorithms :)

Ok.. I read about parallel mergesort... But it's not clear for me.

solution, the first version

code is located here

like image 808
mr. Vachovsky Avatar asked Nov 27 '10 12:11

mr. Vachovsky


People also ask

How do you sort large numbers of data?

For sorting a very large file , we can use external sorting technique. External sorting is an algorithm that can handle massive amounts of data. It is required when the data to be sorted does not fit into the main memory and instead they reside in the slower external memory . It uses a hybrid sort-merge strategy.

Which sorting algorithm is best for large data?

Quick sort is the better suited for large data sets. [8]It is the fastest and efficient algorithm for large sets of data. But it is inefficient if the elements in the list are already sorted which results in the worst case time complexity of O(n2).

What is the best way to sort 1 million integers?

Integer Sorting with limited range is achieved efficiently with Bucket Sorting. Bucket sort, counting sort, radix sort, and van Emde Boas tree sorting all work best when the key size is small; for large enough keys, they become slower than comparison sorting algorithm.

How long does it take to sort 1000 elements?

Selection sort takes one millisecond to sort 1000 items (worst-case time) on a particular computer. Estimate the amount of time it would take to sort 100,000 items.


2 Answers

50 million is not particularly large. I would just read them into memory. Sort them and write them out. It should take just a few seconds. How fast do you need it be? How compilcated do you need it to be?

On my old labtop it took 28 seconds. If I had more processors, it might be a little faster but much of the time is spent reading and writing the file (15 seconds) which wouldn't be any faster.

One of the critical factors is the size of your cache. The comparison itself is very cheap provided the data is in cache. As the L3 cache is shared, one thread is all you need to make full use of it.

public static void main(String...args) throws IOException {
    generateFile();

    long start = System.currentTimeMillis();
    int[] nums = readFile("numbers.bin");
    Arrays.sort(nums);
    writeFile("numbers2.bin", nums);
    long time = System.currentTimeMillis() - start;
    System.out.println("Took "+time+" secs to sort "+nums.length+" numbers.");
}

private static void generateFile() throws IOException {
    Random rand = new Random();
    int[] ints = new int[50*1000*1000];
    for(int i= 0;i<ints.length;i++)
        ints[i] = rand.nextInt();
    writeFile("numbers.bin", ints);
}

private static int[] readFile(String filename) throws IOException {
    DataInputStream dis = new DataInputStream(new BufferedInputStream(new FileInputStream(filename), 64*1024));
    int len = dis.readInt();
    int[] ints = new int[len];
    for(int i=0;i<len;i++)
        ints[i] = dis.readInt();
    return ints;
}

private static void writeFile(String name, int[] numbers) throws IOException {
    DataOutputStream dos = new DataOutputStream(new BufferedOutputStream(new FileOutputStream(name), 64*1024));
    dos.writeInt(numbers.length);
    for (int number : numbers)
        dos.writeInt(number);
    dos.close();
}
like image 85
Peter Lawrey Avatar answered Oct 05 '22 22:10

Peter Lawrey


From top of my head, merge sort seems to be the best option when it comes to parallelisation and distribution, as it uses divide-and-conquer approach. For more information, google for "parallel merge sort" and "distributed merge sort".

For single-machine, multiple cores example, see see Correctly multithreaded quicksort or mergesort algo in Java?. If you can use Java 7 fork/join then see: "Java 7: more concurrency" and "Parallelism with Fork/Join in Java 7".

For distributing it over many machines, see Hadoop, it has a distributed merge sort implementation: see MergeSort and MergeSorter. Also of interest: Hadoop Sorts a Petabyte in 16.25 Hours and a Terabyte in 62 Seconds

like image 29
Neeme Praks Avatar answered Oct 05 '22 22:10

Neeme Praks