Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

More efficient way to find & tar millions of files

Tags:

find

bash

tar

I've got a job running on my server at the command line prompt for a two days now:

find data/ -name filepattern-*2009* -exec tar uf 2009.tar {} ;

It is taking forever, and then some. Yes, there are millions of files in the target directory. (Each file is a measly 8 bytes in a well hashed directory structure.) But just running...

find data/ -name filepattern-*2009* -print > filesOfInterest.txt

...takes only two hours or so. At the rate my job is running, it won't be finished for a couple of weeks.. That seems unreasonable. Is there a more efficient to do this? Maybe with a more complicated bash script?

A secondary questions is "why is my current approach so slow?"

like image 297
Stu Thompson Avatar asked Apr 23 '10 08:04

Stu Thompson


People also ask

How do you find the factors of an efficient way?

Best Approach: If you go through number theory, you will find an efficient way to find the number of factors. If we take a number, say in this case 30, then the prime factors of 30 will be 2, 3, 5 with count of each of these being 1 time, so total number of factors of 30 will be (1+1)*(1+1)*(1+1) = 8.

Is binary search the most efficient?

Binary search is faster than linear search except for small arrays. However, the array must be sorted first to be able to apply binary search. There are specialized data structures designed for fast searching, such as hash tables, that can be searched more efficiently than binary search.


2 Answers

One option is to use cpio to generate a tar-format archive:

$ find data/ -name "filepattern-*2009*" | cpio -ov --format=ustar > 2009.tar

cpio works natively with a list of filenames from stdin, rather than a top-level directory, which makes it an ideal tool for this situation.

like image 108
Matthew Mott Avatar answered Oct 18 '22 11:10

Matthew Mott


If you already did the second command that created the file list, just use the -T option to tell tar to read the files names from that saved file list. Running 1 tar command vs N tar commands will be a lot better.

like image 22
frankc Avatar answered Oct 18 '22 13:10

frankc