I have an ArrayList in Java with a huge amount of files (~40.000 files). I need to sort these files ascending/descending by their date. Currently, I use a simple
Collections.sort(fileList, new FileDateComparator());
where FileDateComparator is
public class FileDateComparator implements Comparator<File>
{
@Override
public int compare(File o1, File o2)
{
if(o1.lastModified() < o2.lastModified())
return -1;
if(o1.lastModified()==o2.lastModified())
return 0;
return 1;
}
}
Sorting takes up up a too long time for me, like 20 seconds or more. Is there a more efficient way to realize this? I already tried Apache I/O LastModifiedFileComparator as comparator, but it seems to be implemented the same way, since it takes the same time.
For sorting a very large file , we can use external sorting technique. External sorting is an algorithm that can handle massive amounts of data. It is required when the data to be sorted does not fit into the main memory and instead they reside in the slower external memory . It uses a hybrid sort-merge strategy.
Suppose we have to sort a 1GB file of random integers and the available ram size is 200 Mb, how will it be done? The easiest way to do this is to use external sorting. We divide our source file into temporary files of size equal to the size of the RAM and first sort these files.
Inside the compare method for return value use the compareTo() method which will return the specified value by comparing the DateItem objects. Now in the main method use Collections. sort() method and pass the ArrayList and 'SortItem' class object to it, it will sort the dates, and output will be generated.
HashSet does not maintain any order while LinkedHashSet maintains insertion order of elements much like List interface and TreeSet maintains sorting order or elements.
I think you need to cache the modification times to speed this up. You could e.g. try something like this:
class DatedFile {
File f;
long moddate;
public DatedFile(File f, long moddate) {
this.f = f;
this.moddate = moddate;
}
};
ArrayList<DatedFile> datedFiles = new ArrayList<DatedFile>();
for (File f: fileList) {
datedFiles.add(new DatedFile(f, f.lastModified()));
}
Collections.sort(fileList, new FileDateComparator());
ArrayList<File> sortedFiles = new ArrayList<File>();
for (DatedFile f: datedFiles) {
sortedFiles.add(f.f);
}
(with an appropriate FileDateComparator implementation)
Sorting is O(n lg N), so your list of 40,000 files will need about 600,000 operations (comparisons). If it takes about 20 seconds, that is about 30,000 comparisons per second. So each comparison is taking about 100,000 clock cycles. That can not be due to CPU-bound processing. The sorting is almost certainly I/O bound rather than CPU bound. Disk seeks are particularly expensive.
You might be able to reduce the time by using multi-threading to reduce the impact of disk seeks. That is, by having several reads queued and waiting for the disk drive to provide their data. To do that, use a (concurrent) map that maps file names to modification times, and populate that map using multiple threads. Then have your sort method use that map rather than use File.lastModified()
itself.
Even if you populated that map with only one thread, you would gain a little benefit because your sort method would be using locally cached modification times, rather than querying the O/S every time for the modification times. The benefit of that caching might not be large, because the O/S itself is likely to cache that information.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With