Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Performance optimization searching data in file system

Tags:

I have a network associated storage where around 5 million txt files are there related to around 3 million transactions. Size of the total data is around 3.5 TB. I have to search in that location to find if the transaction related file is available or not and have to make two separate reports as CSV file of "available files" and "not available files". We are still in JAVA 6. The challenge that I am facing since I have to search in the location recursively, it takes me around average 2 mins to search in that location because of huge size. I am using Java I/O API to search recursively like below. Is there any way I can improve the performance?

File searchFile(File location, String fileName) {
     if (location.isDirectory()) {
         File[] arr = location.listFiles();
         for (File f : arr) {
             File found = searchFile(f, fileName);
             if (found != null)
                 return found;
         }
     } else {
         if (location.getName().equals(fileName)) {
             return location;
         }
     }
     return null;
}
like image 951
Samarjit Baruah Avatar asked Nov 18 '18 23:11

Samarjit Baruah


2 Answers

You should take a different approach, rather than walking the entire directory every time you search for a file, you should instead create an index, which is a mapping from filename to file location.

Essentially:

void buildIndex(Map index, File baseDir) {
    if (location.isDirectory()) {
        File[] arr = location.listFiles();
        for (File f : arr) {
            buildIndex(index, f);
        }
    } else {
        index.put(f.getName(), f);
    }
}

Now that you've got the index, searching for the files becomes trivial.

Now you've got the files in a Map, you can also even use Set operation to find the intersection:

Map index = new HashMap();
buildIndex(index, ...);
Set fileSet = index.keySet();
Set transactionSet = ...;
Set intersection = new HashSet(fileSet);
fileSet.retainAll(transactionSet);

Optionally, if the index itself is too big to keep in memory, you may want to create the index in an SQLite database.

like image 83
Lie Ryan Avatar answered Nov 15 '22 06:11

Lie Ryan


  • Searching in a Directory or a Network Associated Storage is a nightmare.It takes lot of time when directory is too big / depth. As you are in Java 6 , So you can follow an old fashion approach. List all files in a CSV file like below.
  • e.g

    find . -type f -name '*.txt' >> test.csv . (if unix)

    dir /b/s *.txt > test.csv (if Windows)

  • Now load this CSV file into a Map to have an index as filename. Loading the file will take some time as it will be huge but once you load then searching in the map ( as it will be file name ) will be much more quick and will reduce your search time drastically.
like image 38
utpal416 Avatar answered Nov 15 '22 07:11

utpal416