I've got a job running on my server at the command line prompt for a two days now:
find data/ -name filepattern-*2009* -exec tar uf 2009.tar {} ;
It is taking forever, and then some. Yes, there are millions of files in the target directory. (Each file is a measly 8 bytes in a well hashed directory structure.) But just running...
find data/ -name filepattern-*2009* -print > filesOfInterest.txt
...takes only two hours or so. At the rate my job is running, it won't be finished for a couple of weeks.. That seems unreasonable. Is there a more efficient to do this? Maybe with a more complicated bash script?
A secondary questions is "why is my current approach so slow?"
Best Approach: If you go through number theory, you will find an efficient way to find the number of factors. If we take a number, say in this case 30, then the prime factors of 30 will be 2, 3, 5 with count of each of these being 1 time, so total number of factors of 30 will be (1+1)*(1+1)*(1+1) = 8.
Binary search is faster than linear search except for small arrays. However, the array must be sorted first to be able to apply binary search. There are specialized data structures designed for fast searching, such as hash tables, that can be searched more efficiently than binary search.
One option is to use cpio to generate a tar-format archive:
$ find data/ -name "filepattern-*2009*" | cpio -ov --format=ustar > 2009.tar
cpio works natively with a list of filenames from stdin, rather than a top-level directory, which makes it an ideal tool for this situation.
If you already did the second command that created the file list, just use the -T
option to tell tar to read the files names from that saved file list. Running 1 tar command vs N tar commands will be a lot better.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With