Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

fastest way to crawl recursive ntfs directories in C++

I have written a small crawler to scan and resort directory structures.

It based on dirent(which is a small wrapper around FindNextFileA) In my first benchmarks it is surprisingy slow:

around 123473ms for 4500 files(thinkpad t60p local samsung 320 GB 2.5" HD). 121481 files found in 123473 milliseconds Is this speed normal?

This is my code:

int testPrintDir(std::string  strDir, std::string strPattern="*", bool recurse=true){
  struct dirent *ent;
  DIR *dir;
  dir = opendir (strDir.c_str());
  int retVal = 0;
  if (dir != NULL) {
    while ((ent = readdir (dir)) != NULL) {
      if (strcmp(ent->d_name, ".") !=0 &&  strcmp(ent->d_name, "..") !=0){
        std::string strFullName = strDir +"\\"+std::string(ent->d_name);
        std::string strType = "N/A";
        bool isDir = (ent->data.dwFileAttributes & FILE_ATTRIBUTE_DIRECTORY) !=0;
        strType = (isDir)?"DIR":"FILE";                 
        if ((!isDir)){
             //printf ("%s <%s>\n", strFullName.c_str(),strType.c_str());//ent->d_name);
          retVal++;
        }   
        if (isDir && recurse){
             retVal += testPrintDir(strFullName, strPattern, recurse);
        }
      }
    }
    closedir (dir);
    return retVal;
  } else {
    /* could not open directory */
    perror ("DIR NOT FOUND!");
    return -1;
  }
}
like image 930
Peter Parker Avatar asked Jun 14 '10 22:06

Peter Parker


3 Answers

There are some circumstances where such a speed is normal yes. First, using FindFirstFileA instead of FindFirstFileW is going to incur overhead for the conversion from UTF-16 to ANSI. Second, if you are going through directories that have not yet been accessed by the operating system, you will incur at least one seek penalty (about 16ms for most consumer hard drives), limiting your enumeration to well under 100 directory checks per second. This will get worse if the Master File Table on the given drive is badly fragmented.

Regarding number of files, it's going to depend more upon the number of files per directory than the number of files themselves.

like image 173
Billy ONeal Avatar answered Nov 06 '22 09:11

Billy ONeal


The very first time you ever do a recursive directory crawl, you should probably enumerate the entire current directory first and queue up any directories you find to visit when you are done. That way, you are likely to take advantage of any immediate read-ahead optimizations NTFS might do.

On subsequent recursive directory crawls, if the metadata for the directories fits in the system cache, it doesn't matter how you do it.

EDIT: clarifying how you should visit directories. It isn't technically a breadth first search.

like image 3
MSN Avatar answered Nov 06 '22 10:11

MSN


Probably the drive is the bottleneck. But you can try:

  1. String operations can be optimized - use char arrays instęad of std::string.
  2. Building strFullName every recursive call is not necessary. Use single, fixed buffer of chars (i.e. static array inside the function), modify it instantly.
  3. Don't pass the strPattern by value!
  4. Don't create strType until debugging
  5. Others suggested to build a list of directories to process before getting deeper into recursion. To build it I suggest single static array (similarly to 2.) or to use the stack (alloca).
  6. The filesysytem use Unicode to store file names? If so, using unicode strings with FindFirstFileW and FindNextFileW may be a little faster.
like image 2
adf88 Avatar answered Nov 06 '22 09:11

adf88