Performance optimization searching data in file system

Question

I have a network associated storage where around 5 million txt files are there related to around 3 million transactions. Size of the total data is around 3.5 TB. I have to search in that location to find if the transaction related file is available or not and have to make two separate reports as CSV file of "available files" and "not available files". We are still in JAVA 6. The challenge that I am facing since I have to search in the location recursively, it takes me around average 2 mins to search in that location because of huge size. I am using Java I/O API to search recursively like below. Is there any way I can improve the performance?

File searchFile(File location, String fileName) {
     if (location.isDirectory()) {
         File[] arr = location.listFiles();
         for (File f : arr) {
             File found = searchFile(f, fileName);
             if (found != null)
                 return found;
         }
     } else {
         if (location.getName().equals(fileName)) {
             return location;
         }
     }
     return null;
}

Lie Ryan · Accepted Answer

You should take a different approach, rather than walking the entire directory every time you search for a file, you should instead create an index, which is a mapping from filename to file location.

Essentially:

void buildIndex(Map index, File baseDir) {
    if (location.isDirectory()) {
        File[] arr = location.listFiles();
        for (File f : arr) {
            buildIndex(index, f);
        }
    } else {
        index.put(f.getName(), f);
    }
}

Now that you've got the index, searching for the files becomes trivial.

Now you've got the files in a Map, you can also even use Set operation to find the intersection:

Map index = new HashMap();
buildIndex(index, ...);
Set fileSet = index.keySet();
Set transactionSet = ...;
Set intersection = new HashSet(fileSet);
fileSet.retainAll(transactionSet);

Optionally, if the index itself is too big to keep in memory, you may want to create the index in an SQLite database.

utpal416 · Answer

Searching in a Directory or a Network Associated Storage is a nightmare.It takes lot of time when directory is too big / depth. As you are in Java 6 , So you can follow an old fashion approach. List all files in a CSV file like below.
e.g

find . -type f -name '*.txt' >> test.csv . (if unix)

dir /b/s *.txt > test.csv (if Windows)
Now load this CSV file into a Map to have an index as filename. Loading the file will take some time as it will be huge but once you load then searching in the map ( as it will be file name ) will be much more quick and will reduce your search time drastically.

Performance optimization searching data in file system

Tags:

Samarjit Baruah

2 Answers

Lie Ryan

utpal416

Recent Activity

Donate For Us

Performance optimization searching data in file system

Tags:

Samarjit Baruah

2 Answers

Lie Ryan

utpal416

Related questions

Recent Activity

Donate For Us