I have a huge file, composed by ~800M rows (60g). Rows can be duplicates and are composed by an id and a value. For example:
id1 valueA
id1 valueB
id2 valueA
id3 valueC
id3 valueA
id3 valueC
note: ids are not in order (and grouped) as in the example.
I want to aggregate rows by keys, in this way:
id1 valueA,valueB
id2 valueA
id3 valueC,valueA
There are 5000 possible values.
The file doesn't fit in memory so I can't use simple Java Collections. Also, the greatest part of the lines are single (like id2 for example) and they should be written directly in the output file.
For this reason my first solution was to iterate twice the file:
The problem is that I can not finish the first iteration cause to memory limits.
I know that the problem could be faced in several ways (key-value store, map reduce, external sort).
My question is what method could be more adapt to use and fast to implement? It is an only once process and I prefer to use Java methods (not external sort).
As already said (that's quick!), merge-sort is one approach. Concretely, sort locally by id, say, every 1 million lines. Then save the locally sorted lines into smaller files. And then repetitively merge up the smaller, sorted files in pairs into one big sorted file. You can do aggregation when you merge up the smaller files.
The intuition is that, when you merge up 2 sorted lists, you maintain 2 pointers, one for each list, and sort as you go. You don't need to load the complete lists. This allows you to buffer-in big files and buffer-out the merged results immediately.
Here is sample code to sort in-memory and output to a file:
private void sortAndSave(List<String> lines, Path fileOut) throws IOException {
Collections.sort(lines, comparator);
Files.write(fileOut, lines);
}
Here is sample code to sort locally and save the results into smaller files:
// Sort once we collect 1000000 lines
final int cutoff = 1000000;
final List<String> lines = new ArrayList<>();
int fileCount = 0;
try (BufferedReader reader = Files.newBufferedReader(fileIn, charset)) {
String line = reader.readLine();
while (line != null) {
lines.add(line);
if (lines.size() > cutoff) {
fileCount++;
sortAndSave(lines, Paths.get("fileOut" + fileCount));
lines.clear();
}
line = reader.readLine();
}
if (lines.size() > 0) {
fileCount++;
sortAndSave(lines, fileOut, Paths.get("fileOut" + fileCount));
}
}
Here is sample code to merge sort 2 files:
try (BufferedReader reader1 = Files.newBufferedReader(file1, charset);
BufferedReader reader1 = Files.newBufferedReader(file2, charset);
BufferedWriter writer = Files.newBufferedWriter(fileOut, charset)) {
String line1 = reader1.read();
String line2 = reader2.read();
while (line1 != null && line2 != null) {
if (comparator.compare(line1, line2) < 0) {
writer.write(line2);
line2 = reader2.read();
} else {
writer.write(line1);
line1 = reader1.read();
}
}
if (line1 != null) {
// TODO: Merge in the remaining lines of file1
} else if (line2 != null {
// TODO: Merge in the remaining lines of file2
}
}
when dealing with such large amount of data, we need to think out of the box - and buffer the entire thing
lets say that i have 4gb of a video and i'm trying to load it to my video player.. my player would basically need to perform 2 main operations:
buffering - 'splitting' the video to chunks and read one chunk at a time (buffer)
streaming - displaying the result (video) to my software (player)
why? because it would be impossible to load everything to memory at once (and we don't even really need it... at a specific moment the user observes a portion of the video from the buffer (which is a portion of the entire file)
we can do the same thing for large files:
after this process we have many small files that contains information like this
id1 valueA,valueB
id2 valueA
id3 valueC,valueA
so each grouped file contains less rows the the original small file it derived from
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With