Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Aggregate key-value lines in a file by keys in Java

I have a huge file, composed by ~800M rows (60g). Rows can be duplicates and are composed by an id and a value. For example:

id1   valueA
id1   valueB 
id2   valueA 
id3   valueC
id3   valueA
id3   valueC

note: ids are not in order (and grouped) as in the example.

I want to aggregate rows by keys, in this way:

id1   valueA,valueB
id2   valueA
id3   valueC,valueA

There are 5000 possible values.

The file doesn't fit in memory so I can't use simple Java Collections. Also, the greatest part of the lines are single (like id2 for example) and they should be written directly in the output file.

For this reason my first solution was to iterate twice the file:

  • In the first iteration I store two structures, with only ids and no values:
    • single value ids (S1)
    • multiple values ids (S2)
  • in the second iteration, after discarding single value ids (S1) from memory, I could write directly single values id-value pairs to the output file checking if they are not in multiple values ids (S2)

The problem is that I can not finish the first iteration cause to memory limits.

I know that the problem could be faced in several ways (key-value store, map reduce, external sort).

My question is what method could be more adapt to use and fast to implement? It is an only once process and I prefer to use Java methods (not external sort).

like image 336
miccia4 Avatar asked Jan 21 '16 06:01

miccia4


2 Answers

As already said (that's quick!), merge-sort is one approach. Concretely, sort locally by id, say, every 1 million lines. Then save the locally sorted lines into smaller files. And then repetitively merge up the smaller, sorted files in pairs into one big sorted file. You can do aggregation when you merge up the smaller files.

The intuition is that, when you merge up 2 sorted lists, you maintain 2 pointers, one for each list, and sort as you go. You don't need to load the complete lists. This allows you to buffer-in big files and buffer-out the merged results immediately.

Here is sample code to sort in-memory and output to a file:

private void sortAndSave(List<String> lines, Path fileOut) throws IOException {
    Collections.sort(lines, comparator);
    Files.write(fileOut, lines);
}

Here is sample code to sort locally and save the results into smaller files:

// Sort once we collect 1000000 lines
final int cutoff = 1000000;
final List<String> lines = new ArrayList<>();
int fileCount = 0;
try (BufferedReader reader = Files.newBufferedReader(fileIn, charset)) {
    String line = reader.readLine();
    while (line != null) {
        lines.add(line);
        if (lines.size() > cutoff) {
            fileCount++;
            sortAndSave(lines, Paths.get("fileOut" + fileCount));
            lines.clear();
        }
        line = reader.readLine();
    }
    if (lines.size() > 0) {
        fileCount++;
        sortAndSave(lines, fileOut, Paths.get("fileOut" + fileCount));
    }
}

Here is sample code to merge sort 2 files:

try (BufferedReader reader1 = Files.newBufferedReader(file1, charset);
     BufferedReader reader1 = Files.newBufferedReader(file2, charset);
     BufferedWriter writer = Files.newBufferedWriter(fileOut, charset)) {
    String line1 = reader1.read();
    String line2 = reader2.read();
    while (line1 != null && line2 != null) {
        if (comparator.compare(line1, line2) < 0) {
            writer.write(line2);
            line2 = reader2.read();
        } else {
            writer.write(line1);
            line1 = reader1.read();
        }
    }
    if (line1 != null) {
        // TODO: Merge in the remaining lines of file1
    } else if (line2 != null {
        // TODO: Merge in the remaining lines of file2
    }
}
like image 165
neurite Avatar answered Nov 13 '22 01:11

neurite


when dealing with such large amount of data, we need to think out of the box - and buffer the entire thing

First: how it's working already?

lets say that i have 4gb of a video and i'm trying to load it to my video player.. my player would basically need to perform 2 main operations:

buffering - 'splitting' the video to chunks and read one chunk at a time (buffer)

streaming - displaying the result (video) to my software (player)

why? because it would be impossible to load everything to memory at once (and we don't even really need it... at a specific moment the user observes a portion of the video from the buffer (which is a portion of the entire file)

Second: how this can help us?

we can do the same thing for large files:

  1. split the main file into smaller files (each file contains X rows where X is the 'buffer')
  2. load it to java and group it
  3. save the result to a new file

after this process we have many small files that contains information like this

id1   valueA,valueB
id2   valueA
id3   valueC,valueA

so each grouped file contains less rows the the original small file it derived from

  1. we can now merge it back and try to load it to java and re-group everything
  2. if the process fails (still too big) we can merge the small grouped files into several grouped files (and repeat the process)
like image 39
ymz Avatar answered Nov 13 '22 01:11

ymz