Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split huge file of integers (in one line) into sorted chunks with memory restriction

I recently needed to sort a one line file (integers separated by ",") into smaller chunks with memory restriction and efficiency in mind. I'm currently following this logic:

File file = new File("bigfile.txt");
FileInputStream fis = new FileInputStream(file);
BufferedInputStream bis = new BufferedInputStream(fis);
int BUFFER_SIZE = 10; // can and should be bigger
byte[] bytes = new byte[BUFFER_SIZE];
while ((bis.read(bytes)) != -1) {
   // convert bytes to string
   // split bytes to String[]
   // save the last number if was cut in the middle and save it for the next round of reading and remove it from the current String[]
   // fix cut number if necessary and put it in the String[]
   // sort the String[]
   // write the String[] into a file
   // call Garbage collector to prevent memory leak?
}
bis.close();

Assuming I'm restricted to 5MB of memory and have to read a one-line file with 10,000,000 integers separated by ",":

  • If I use a very small buffer size (ex. 10) to read the file then I'll create thousands of files.
  • If I use a decent but still small buffer size (ex. 100KB) then I will still get a lot of files.
  • If I use a bigger buffer size (ex. 4MB) then I will have heap problems when sorting and splitting the result in memory due to the restriction.

What is the best approach for me to take to get the least amount of sorted files (or the biggest chunks of data per file possible)?

like image 751
BestFaucetList XYZ Support Avatar asked Sep 26 '19 21:09

BestFaucetList XYZ Support


2 Answers

I believe you can apply the Two-Pass Multiway Mergesort (TPMMS) to solve the problem.

I will provide you with a general idea of what you can do, however, it would be better if you read more about the TPMMS:

// Every time you read a chunk you must be sure you are not leaving any number out (If the final bit is a number keep reading bit by bit until you reach a ",")

  • Given a limited amount of RAM and following the TPPMS, you will have to separate the file into chunks, order it, and then save each chunk into individual files.
  • For each of the files, create a PriorityQueue and read a certain amount of byte (such that you can read all the small files) and transform them back to number to store it in the queue. I'll call this list of PriorityQueues pqs for convenience.
  • Create another PriorityQueue (pq) with the size equal to the number of small files you have and push the first value of each queue of pqs.
  • Now comes the fun part; since you are using a PriorityQueue (pq) and you popped the first value of each PriorityQueue in pqs it is guaranteed that the first value in pq is the smallest value. (For every element popped in pq you can either write it directly to the final file or you can save it in an array and write it to the final file when the array is full, I prefer the last option.)
  • Every time you pop in pq you will have to read the next number from the file where that value was obtained, put it to the correct PriorityQueue in pqs and then pop the first value of that PriorityQueue.
  • You repeat the previous step until all the PriorityQueue in pqs are empty.

You will have to play around with the size of each buffer since you have a limited amount of memory.

like image 51
elco45 Avatar answered Sep 21 '22 12:09

elco45


The task is not an easy one. I'm sure this is not the best approach, but better than nothing:

  1. Find or create a list like PriorityQueue with size and comparator constructor args. Size should be known (according to your requirement)
    • Method add(..) should match O(log n) and sort items while inserting them
    • Method add(..) should return false when list fully filled (otherwise true)
  2. Perform the stream reading and immediately (without buffering) add integers to your collection. Create a new list if no free space.
  3. Now you have a collection of sorted lists. The last step is to sort data through lists: [1,4,5],[3,8,9],[2,6,7] --> [1,2,3], [4,5,6], [7,8,9] For example, by picking minimums from list#1 and list#2, comparing them, so on. . .

NOTE: Also you can perform Step#3 in concurrent

About #2: I missed that you have string-data. So parsing byte sequence to integers is bad idea. However, it should be possible to parse data char-by-char and then convert to int when a comma appears. Also, buffer size can be calculated (max number length * bytes per symbol) -> for 2147483647, in UTF-8 it's 11 * 1.

like image 33
Rostyslav Barmakov Avatar answered Sep 22 '22 12:09

Rostyslav Barmakov