Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Faster way to get the total space taken by the directory containing 5 million files in linux

I have a target board running linux which has approximately around 5 million+ files in the directory. (This directory doesn't have any sub directories) If i execute this program it takes several minutes to get the total space information. Is there a faster way to accomplish this? Thanks

#include <stdio.h>
#include <dirent.h>
#include <string.h>
#include <stdlib.h>
#include <limits.h>
#include <sys/stat.h>
#include <errno.h>


void calcSpace(char *path,  long long int *totalSpace) 
{
    DIR *dir;                /* dir structure we are reading */
    struct dirent *ent;      /* directory entry currently being processed */
    char absPath[200];
    struct stat statbuf;     /* buffer for stat()*/
    long long int fileCount=0;

    fprintf(stderr, "Opening dir %s\n", path);
    dir = opendir(path);
    if(NULL == dir) {
        perror(path);
        return;
    }
    while((ent = readdir(dir))) 
    {
       fileCount++;
       sprintf(absPath, "%s/%s", path, ent->d_name);
       if(stat(absPath, &statbuf)) {
          perror(absPath);
          return;
       }
       *totalSpace= (*totalSpace) + statbuf.st_size;
    }

    fprintf(stderr, "Closing dir %s\n", path);
    printf("fileCount=%lld.\n", fileCount);
    closedir(dir);
}

int main(int argc, char *argv[]) 
{
    char *dir;
    long long int totalSpace=0;
    if(argc > 1)
        dir = argv[1];
    else
        dir = ".";

    calcSpace(dir,  &totalSpace);
    printf("totalSpace=%lld\n", totalSpace);
    return 0;
}
like image 955
m4n07 Avatar asked Nov 06 '22 12:11

m4n07


1 Answers

As stated in the comments, it seems that the main cost are the calls for stat and readdir.

Optimizing readdir calls

We can save some serious costs on the readdir costs using the getdents(2) syscall instead. This syscall is similar to 'readdir', but you can use it in order to read multiple directory entries in each syscall - greatly reducing the overhead of calling the readdir syscall for each entry.

A code example can be found in the man page I linked to. The importat thing to note is that you probably should fiddle around with the amount of entries read at a time using getdents (the count paramater - in the example it is 1024), to find the sweet spot for your configuration and machine (this step will probably give you the performance improvement you want over readdir).

Optimizing stat calls

It is recommended to use the fstatat(2) function instead of the regular stat you used (relevant man page for both). This is because the first paramter of fstatat(2) is a dirfd - file descriptor for the directory the file you are stating is under.

That means that you can open a file descriptor to the directory once (using open(2)), and then all fstatat calls will be done with this dirfd. This will optimize the stating process in the kernel (as the reference for the entire path and the directory itself shouldn't be resolved for every stat syscall anymore), and would probably make your code simpler and a little bit faster (as path concatenation will not be needed anymore).

like image 147
or523 Avatar answered Nov 13 '22 20:11

or523