I have a network associated storage where around 5 million txt files are there related to around 3 million transactions. Size of the total data is around 3.5 TB. I have to search in that location to find if the transaction related file is available or not and have to make two separate reports as CSV file of "available files" and "not available files". We are still in JAVA 6. The challenge that I am facing since I have to search in the location recursively, it takes me around average 2 mins to search in that location because of huge size. I am using Java I/O API to search recursively like below. Is there any way I can improve the performance?
File searchFile(File location, String fileName) {
if (location.isDirectory()) {
File[] arr = location.listFiles();
for (File f : arr) {
File found = searchFile(f, fileName);
if (found != null)
return found;
}
} else {
if (location.getName().equals(fileName)) {
return location;
}
}
return null;
}
You should take a different approach, rather than walking the entire directory every time you search for a file, you should instead create an index, which is a mapping from filename to file location.
Essentially:
void buildIndex(Map index, File baseDir) {
if (location.isDirectory()) {
File[] arr = location.listFiles();
for (File f : arr) {
buildIndex(index, f);
}
} else {
index.put(f.getName(), f);
}
}
Now that you've got the index, searching for the files becomes trivial.
Now you've got the files in a Map, you can also even use Set operation to find the intersection:
Map index = new HashMap();
buildIndex(index, ...);
Set fileSet = index.keySet();
Set transactionSet = ...;
Set intersection = new HashSet(fileSet);
fileSet.retainAll(transactionSet);
Optionally, if the index itself is too big to keep in memory, you may want to create the index in an SQLite database.
find . -type f -name '*.txt' >> test.csv . (if unix)
dir /b/s *.txt > test.csv (if Windows)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With