I recently needed to sort a one line file (integers separated by ",") into smaller chunks with memory restriction and efficiency in mind. I'm currently following this logic:
File file = new File("bigfile.txt");
FileInputStream fis = new FileInputStream(file);
BufferedInputStream bis = new BufferedInputStream(fis);
int BUFFER_SIZE = 10; // can and should be bigger
byte[] bytes = new byte[BUFFER_SIZE];
while ((bis.read(bytes)) != -1) {
// convert bytes to string
// split bytes to String[]
// save the last number if was cut in the middle and save it for the next round of reading and remove it from the current String[]
// fix cut number if necessary and put it in the String[]
// sort the String[]
// write the String[] into a file
// call Garbage collector to prevent memory leak?
}
bis.close();
Assuming I'm restricted to 5MB of memory and have to read a one-line file with 10,000,000 integers separated by ",":
What is the best approach for me to take to get the least amount of sorted files (or the biggest chunks of data per file possible)?
I believe you can apply the Two-Pass Multiway Mergesort (TPMMS) to solve the problem.
I will provide you with a general idea of what you can do, however, it would be better if you read more about the TPMMS:
// Every time you read a chunk you must be sure you are not leaving any number out (If the final bit is a number keep reading bit by bit until you reach a ",")
You will have to play around with the size of each buffer since you have a limited amount of memory.
The task is not an easy one. I'm sure this is not the best approach, but better than nothing:
list
like PriorityQueue
with size
and comparator
constructor args. Size should be known (according to your requirement)
add(..)
should match O(log n)
and sort items while inserting themadd(..)
should return false
when list fully filled (otherwise true
)[1,4,5],[3,8,9],[2,6,7]
--> [1,2,3], [4,5,6], [7,8,9]
For example, by picking minimums from list#1 and list#2, comparing them, so on. . .NOTE: Also you can perform Step#3 in concurrent
About #2:
I missed that you have string-data. So parsing byte sequence to integers is bad idea. However, it should be possible to parse data char-by-char and then convert to int when a comma appears. Also, buffer size can be calculated (max number length * bytes per symbol) -> for 2147483647,
in UTF-8 it's 11 * 1.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With