Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast sort by date of huge file ArrayList

I have an ArrayList in Java with a huge amount of files (~40.000 files). I need to sort these files ascending/descending by their date. Currently, I use a simple

Collections.sort(fileList, new FileDateComparator());

where FileDateComparator is

public class FileDateComparator implements Comparator<File>
{
       @Override
       public int compare(File o1, File o2)
       {
           if(o1.lastModified() < o2.lastModified())
               return -1;
           if(o1.lastModified()==o2.lastModified())
               return 0;
          return 1;
       }
    }

Sorting takes up up a too long time for me, like 20 seconds or more. Is there a more efficient way to realize this? I already tried Apache I/O LastModifiedFileComparator as comparator, but it seems to be implemented the same way, since it takes the same time.

like image 924
user3581199 Avatar asked Apr 28 '14 11:04

user3581199


People also ask

How do you sort huge files?

For sorting a very large file , we can use external sorting technique. External sorting is an algorithm that can handle massive amounts of data. It is required when the data to be sorted does not fit into the main memory and instead they reside in the slower external memory . It uses a hybrid sort-merge strategy.

How do I sort large files with small memory?

Suppose we have to sort a 1GB file of random integers and the available ram size is 200 Mb, how will it be done? The easiest way to do this is to use external sorting. We divide our source file into temporary files of size equal to the size of the RAM and first sort these files.

How do you sort a list by date?

Inside the compare method for return value use the compareTo() method which will return the specified value by comparing the DateItem objects. Now in the main method use Collections. sort() method and pass the ArrayList and 'SortItem' class object to it, it will sort the dates, and output will be generated.

Does LinkedHashSet sort?

HashSet does not maintain any order while LinkedHashSet maintains insertion order of elements much like List interface and TreeSet maintains sorting order or elements.


2 Answers

I think you need to cache the modification times to speed this up. You could e.g. try something like this:

class DatedFile {
  File f;
  long moddate;

  public DatedFile(File f, long moddate) {
    this.f = f;
    this.moddate = moddate;
  }
};


ArrayList<DatedFile> datedFiles = new ArrayList<DatedFile>();
for (File f: fileList) {
  datedFiles.add(new DatedFile(f, f.lastModified()));
}
Collections.sort(fileList, new FileDateComparator());
ArrayList<File> sortedFiles = new ArrayList<File>();
for (DatedFile f: datedFiles) {
  sortedFiles.add(f.f);
}

(with an appropriate FileDateComparator implementation)

like image 150
thejh Avatar answered Oct 19 '22 03:10

thejh


Sorting is O(n lg N), so your list of 40,000 files will need about 600,000 operations (comparisons). If it takes about 20 seconds, that is about 30,000 comparisons per second. So each comparison is taking about 100,000 clock cycles. That can not be due to CPU-bound processing. The sorting is almost certainly I/O bound rather than CPU bound. Disk seeks are particularly expensive.

You might be able to reduce the time by using multi-threading to reduce the impact of disk seeks. That is, by having several reads queued and waiting for the disk drive to provide their data. To do that, use a (concurrent) map that maps file names to modification times, and populate that map using multiple threads. Then have your sort method use that map rather than use File.lastModified() itself.

Even if you populated that map with only one thread, you would gain a little benefit because your sort method would be using locally cached modification times, rather than querying the O/S every time for the modification times. The benefit of that caching might not be large, because the O/S itself is likely to cache that information.

like image 28
Raedwald Avatar answered Oct 19 '22 03:10

Raedwald