I have a target board running linux which has approximately around 5 million+ files in the directory. (This directory doesn't have any sub directories) If i execute this program it takes several minutes to get the total space information. Is there a faster way to accomplish this? Thanks
#include <stdio.h>
#include <dirent.h>
#include <string.h>
#include <stdlib.h>
#include <limits.h>
#include <sys/stat.h>
#include <errno.h>
void calcSpace(char *path, long long int *totalSpace)
{
DIR *dir; /* dir structure we are reading */
struct dirent *ent; /* directory entry currently being processed */
char absPath[200];
struct stat statbuf; /* buffer for stat()*/
long long int fileCount=0;
fprintf(stderr, "Opening dir %s\n", path);
dir = opendir(path);
if(NULL == dir) {
perror(path);
return;
}
while((ent = readdir(dir)))
{
fileCount++;
sprintf(absPath, "%s/%s", path, ent->d_name);
if(stat(absPath, &statbuf)) {
perror(absPath);
return;
}
*totalSpace= (*totalSpace) + statbuf.st_size;
}
fprintf(stderr, "Closing dir %s\n", path);
printf("fileCount=%lld.\n", fileCount);
closedir(dir);
}
int main(int argc, char *argv[])
{
char *dir;
long long int totalSpace=0;
if(argc > 1)
dir = argv[1];
else
dir = ".";
calcSpace(dir, &totalSpace);
printf("totalSpace=%lld\n", totalSpace);
return 0;
}
As stated in the comments, it seems that the main cost are the calls for stat
and readdir
.
Optimizing readdir
calls
We can save some serious costs on the readdir
costs using the getdents(2)
syscall instead. This syscall is similar to 'readdir', but you can use it in order to read multiple directory entries in each syscall - greatly reducing the overhead of calling the readdir
syscall for each entry.
A code example can be found in the man
page I linked to. The importat thing to note is that you probably should fiddle around with the amount of entries read at a time using getdents
(the count
paramater - in the example it is 1024), to find the sweet spot for your configuration and machine (this step will probably give you the performance improvement you want over readdir
).
Optimizing stat
calls
It is recommended to use the fstatat(2)
function instead of the regular stat
you used (relevant man page for both). This is because the first paramter of fstatat(2)
is a dirfd
- file descriptor for the directory the file you are stating is under.
That means that you can open a file descriptor to the directory once (using open(2)
), and then all fstatat
calls will be done with this dirfd
. This will optimize the stating process in the kernel (as the reference for the entire path and the directory itself shouldn't be resolved for every stat
syscall anymore), and would probably make your code simpler and a little bit faster (as path concatenation will not be needed anymore).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With