I've got a job running on my server at the command line prompt for a two days now: <pre class="prettyprint"><code>find data/ -name filepattern-*2009* -exec tar uf 2009.tar {} ; </code></pre> It is taking forever, and then some. Yes, there are millions of files in the target directory. (Each file is a measly 8 bytes in a well hashed directory structure.) But just running... <pre class="prettyprint"><code>find data/ -name filepattern-*2009* -print > filesOfInterest.txt </code></pre> ...takes only two hours or so. At the rate my job is running, it won't be finished for a couple of weeks.. That seems unreasonable. Is there a more efficient to do this? Maybe with a more complicated bash script? A secondary questions is "why is my current approach so slow?"

One option is to use cpio to generate a tar-format archive: <pre class="prettyprint"><code>$ find data/ -name "filepattern-*2009*" | cpio -ov --format=ustar > 2009.tar </code></pre> cpio works natively with a list of filenames from stdin, rather than a top-level directory, which makes it an ideal tool for this situation.

If you already did the second command that created the file list, just use the <code>-T</code> option to tell tar to read the files names from that saved file list. Running 1 tar command vs N tar commands will be a lot better.

More efficient way to find & tar millions of files

Tags:

find

bash

tar

I've got a job running on my server at the command line prompt for a two days now:

find data/ -name filepattern-*2009* -exec tar uf 2009.tar {} ;

It is taking forever, and then some. Yes, there are millions of files in the target directory. (Each file is a measly 8 bytes in a well hashed directory structure.) But just running...

find data/ -name filepattern-*2009* -print > filesOfInterest.txt

...takes only two hours or so. At the rate my job is running, it won't be finished for a couple of weeks.. That seems unreasonable. Is there a more efficient to do this? Maybe with a more complicated bash script?

A secondary questions is "why is my current approach so slow?"

297

asked Apr 23 '10 08:04

Stu Thompson

2 Answers

One option is to use cpio to generate a tar-format archive:

$ find data/ -name "filepattern-*2009*" | cpio -ov --format=ustar > 2009.tar

cpio works natively with a list of filenames from stdin, rather than a top-level directory, which makes it an ideal tool for this situation.

108

answered Oct 18 '22 11:10

Matthew Mott

If you already did the second command that created the file list, just use the -T option to tell tar to read the files names from that saved file list. Running 1 tar command vs N tar commands will be a lot better.

answered Oct 18 '22 13:10

frankc

Related questions
                            
                                Git Bash (1.9.0) using Windows Explorer intergration crashes folder explorer
                            
                                FFmpeg script skips files
                            
                                match a line using bash regex
                            
                                Can I pipe my node.js script output in to `less` without typing `| less` when run?
                            
                                Select lines based on value in a column
                            
                                Linux "at" command works fine when running from shell but fails when run from webserver
                            
                                Git - fatal: Could not get current working directory?
                            
                                bash: wait for specific command output before continuing
                            
                                List only common parent directories for files
                            
                                What is the width of fixed-width integers in bash?
                            
                                Bash script that allows qsub in TORQUE to wait until the job get finished, pretty much like -sync y in SGE system
                            
                                Questions about FIFOs and file descriptor
                            
                                capture expect ssh output to variable
                            
                                increment a variable in bash ` i=0; ls>$((++i)); echo i=$i; ` why the result is i=0
                            
                                Shell recognizes files in ~ but not in ~/Documents
                            
                                How do I completely uninstall git from my Linux Machine
                            
                                Linux shell: Adding dots to numerical outputs to make them more readable
                            
                                Deleting a multiline block of text between regex pattern using sed
                            
                                How can I match square bracket in regex with grep?
                            
                                MySQL command line '-bash command not found'

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With