Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sorting a huge file in Java

I have a file, which consists of a one row:

 1 , 1 2 , 1 3 6 , 4 ,...

In this representation, spaces separate the integers and commas. This string is so huge that I can't read it with RandomAccessFile.readLine() (almost 4 Gb needed). So that I created a buffer, which can contain 10 integers. My task is to sort all integers in the string.

Could you, please, help?

EDIT

@Oscar Reyes

I need to write some sequences of integers to a file and then to read from it. Actually I don't know, how to do it. I'm a newbie. So I decided to use chars to write integers, delimiters between integers are ",", and delimeters between sequences are "\n\r" which. So that I created a monster that reads it:

public BinaryRow getFilledBuffer(String filePath, long offset) throws IOException{
    mainFile = new RandomAccessFile(filePath, "r");

    if (mainFile.length() == 0){
        return new BinaryRow();
    }

    StringBuilder str = new StringBuilder();

    mainFile.seek(mainFile.length()-4); //that is "\n" symbol
    char chN = mainFile.readChar();

    mainFile.seek(offset);
    int i = 0;
    char nextChar = mainFile.readChar();
    while (i < 11 && nextChar != chN){
        str.append(nextChar);
        if (nextChar == ','){
            i++;
            if (i == 10){
                break;
            }
        }
        nextChar = mainFile.readChar();
    }

    if (nextChar == chN){
        position = -1;
    }else{
        position = mainFile.getFilePointer();
    }

    BinaryRow br = new BinaryRow();

    StringBuilder temp = new StringBuilder();

    for (int j = 0; j < str.length(); j++){
        if ((str.charAt(j) != ',')){
            temp.append(str.charAt(j));
            if (j == str.length() - 1){
                br.add(Integer.parseInt(temp.toString()));
            }   
        }else{
            br.add(Integer.parseInt(temp.toString()));
            temp.delete(0, temp.length());
        }
    }


    mainFile.close();
    return br;

}

If you could advise how to do it, please do it =)

like image 597
Dmitry Avatar asked Mar 04 '10 21:03

Dmitry


2 Answers

This is exactly the origin QuickSort back then there was not enough RAM to sort in memory so they procedure is to store partial results in disk.

So what you can do is:

  1. Pick a pivot.
  2. Read sequentially your file and store data lower than pivot in temp_file_1 and data bigger or equal to the pivot in temp_file_2
  3. Repeat the procedure in temp_file_1 and append the result to result_file
  4. Repeat the procedure for temp_file_2 and append the result to result_file

When parts are small enough ( like 2 just direct swap them Enough to be sorted in memory )

This way you'll be able to sort in chunks and store the partial results in temp files and you'll have a final file with the result sorted.

EDIT I told you a quick sort was possible.

It seems like you would need some extra space for the temp files after all.

Here's what I did.

I create a 40 mb file with numbers separated by commas.

I name it input:

input http://img200.imageshack.us/img200/5129/capturadepantalla201003t.png

Input is 40mb

During the sort, the tmp files with the buckets of "greater than", "lower than" values are created and when the sort is finished, the values are sent to a file called ( guess what ) output

processing http://img200.imageshack.us/img200/1672/capturadepantalla201003y.png

Temp files are created with the partial results

Finally all the tmp files are deleted and the result is kept in the file "output" with the correct sorted sequence of numbers:

output http://img203.imageshack.us/img203/5950/capturadepantalla201003w.png

Finally the file "output" is created, notice it is 40 mb too

Here's the full program.

import java.io.*;
import java.util.*;

public class FileQuickSort {

    static final int MAX_SIZE = 1024*1024*16; // 16 megabytes in this sample, the more memory your program has, less disk writing will be used. 
    public static void main( String [] args ) throws IOException {
        fileQuickSort( new File("input"), new File("output"));
        System.out.println();
    }

    //
    static void fileQuickSort( File inputFile, File outputFile ) throws IOException {
        Scanner scanner = new Scanner( new BufferedInputStream( new FileInputStream( inputFile ), MAX_SIZE));
        scanner.useDelimiter(",");

        if( inputFile.length() > MAX_SIZE && scanner.hasNextInt()) {
            System.out.print("-");

            // put them in two buckets... 
            File lowerFile = File.createTempFile("quicksort-","-lower.tmp",new File("."));
            File greaterFile = File.createTempFile("quicksort-","-greater.tmp", new File("."));
            PrintStream  lower   = createPrintStream(lowerFile);
            PrintStream greater  = createPrintStream(greaterFile);
            PrintStream target = null;
            int pivot = scanner.nextInt();

            // Read the file and put the values greater than in a file 
            // and the values lower than in other 
            while( scanner.hasNextInt() ){
                int current = scanner.nextInt();

                if( current < pivot ){
                    target = lower;
                } else {
                    target = greater;
                }
                target.printf("%d,",current);
            }
            // avoid dropping the pivot
            greater.printf("%d,",pivot);
            // close the stream before reading them again
            scanner.close();
            lower.close();
            greater.close();
            // sort each part
            fileQuickSort( lowerFile , outputFile );
            lowerFile.delete();
            fileQuickSort( greaterFile   , outputFile);
            greaterFile.delete();

            // And you're done.
        } else {

            // Else , if you have enough RAM to process it
            // 
            System.out.print(".");
            List<Integer> smallFileIntegers = new ArrayList<Integer>();
            // Read it
            while( scanner.hasNextInt() ){
                smallFileIntegers.add( scanner.nextInt() );
            }
            scanner.close();

            // Sort them in memory 
            Collections.sort( smallFileIntegers );

            PrintStream out = createPrintStream( outputFile);
            for( int i : smallFileIntegers ) {
                out.printf("%d,",i);
            }
            out.close();
            // And your're done
        }
    }
    private static PrintStream createPrintStream( File file ) throws IOException {
        boolean append = true;
        return new PrintStream(  new BufferedOutputStream( new FileOutputStream( file, append )));
    }
}

The format of the files is number,number,number,number

Your current format is: n u m b e r , n u m b , b e r

To fix that you just have to read it all and skip the blanks.

Add another question for that.

like image 69
OscarRyz Avatar answered Oct 04 '22 23:10

OscarRyz


Read it to memory in chunks (100 MB each?), one chunk at a time, sort it and save to disk.

Then open all the ordered chunks, read the first element of each, and append the lowest to the output. Then read the next element of the chunk you just read from and repeat.

When merging you can keep an array of the last int read from each chunk and just iterate over it to get the lowest. Then you substitute the value you just used with the next element in the chunk it was taken from.

example with chunks [1, 5, 16] [2, 9, 14] [3, 8, 10]
array [(1), 2, 3], lowest 1 --> to output
      [5, (2), 3], lowest 2 --> to output
      [5, 9, (3)], lowest 3 -->
      [(5), 9, 8],        5
      [16, 9, (8)],       8
      [16, (9), 10],      9 
...
like image 21
Utaal Avatar answered Oct 04 '22 21:10

Utaal