Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the best approach to sort 2 large text files in Java? [closed]

I am building a simple java application that involves reading information from a csv file. The information in the csv file comes in this form:

"ID","Description"
"AB","Some sort of information for AB"
"AC","Some sort of information for AC"

I am required to let user print out the description, the id, or both, in a console sorted by ID. The simplest solution would be to parse the files with a CSV library, such as opencsv, and put these string in a TreeMap, and print the content of the TreeMap. The key in the TreeMap would be the ID, and the value is the description.

However the CSV file could be huge. It could be 5 GB, and loading 5GB of strings into a TreeMap would cause an out of memory error. To handle large files, I could sort the files using an external merge sort. Once, I get the sorted file, I could print out the content of the file into the console by simply reading the file.

An external merge sort will definitely be much slower than loading the content of the file into a TreeMap. I am considering to detect the file size. If the file size is more than the available memory, then I will use an external merge sort. Otherwise, I will load the content of the file into the TreeMap.

However, this would mean there will be two separate block of codes that perform 2 different sorting. Therefore increasing the amount of code that need to be maintained. If you would write this application would you consider writing 2 separated codes code that would handle a small csv file, and a big csv file separately. Or would you just sort the file using an external merge sort irregardless of the file size ?

Or is there an alternative to this approach ?

Thank you.

like image 746
zfranciscus Avatar asked Oct 05 '22 07:10

zfranciscus


1 Answers

Parse the csv file yourself, adding only the ID column to the TreeMap, as value record the byte length up to that line. Afterward for printing use a RandomAccessFile to read the corresponding lines. If this approach still overflows your memory, take a look at MapDB. It provides TreeMap implementations that seamlessly overflow to disk and has great performance.

like image 89
kpentchev Avatar answered Oct 10 '22 02:10

kpentchev