Current Process:
tar.gz
file. (Actually, I have about 2000 of them, but that's another story). tar.gz
file, revealing 100,000 tiny files (around 600 bytes each).The temporary space on the machines I'm using can barely handle one of these processes at once, never mind the 16 (hyperthreaded dual quad core) that they get sent by default.
I'm looking for a way to do this process without saving to disk. I believe the performance penalty for individually pulling files using tar -xf $file -O <targetname>
would be prohibitive, but it might be what I'm stuck with.
Is there any way of doing this?
EDIT: Since two people have already made this mistake, I'm going to clarify:
EDIT2: Actual code:
for f in posns/*; do
~/data_analysis/intermediate_scattering_function < "$f"
done | ~/data_analysis/complex_autocorrelation.awk limit=1000 > inter_autocorr.txt
If you do not care about the boundaries between files, then tar --to-stdout -xf $file
will do what you want; it will send the contents of each file in the archive to stdout one after the other.
This assumes you are using GNU tar, which is reasonably likely if you are using bash.
[Update]
Given the constraint that you do want to process each file separately, I agree with Charles Duffy that a shell script is the wrong tool.
You could try his Python suggestion, or you could try the Archive::Tar Perl module. Either of these would allow you to iterate through the tar file's contents in memory.
This sounds like a case where the right tool for the job is probably not a shell script. Python has a tarfile
module which can operate in streaming mode, letting you make only a single pass through the large archive and process its files, while still being able to distinguish the individual files (which the tar --to-stdout
approach will not).
You can use the tar option --to-command=cmd
to execute the command for each file. Tar redirects the file content to the standard input of the command, and sets some environment variables with details about the file, such as TAR_FILENAME. More details in Tar Documentation.
e.g.
tar zxf file.tar.gz --to-command='./process.sh'
Note that OSX uses bsdtar
by default, which does not have this option. You can explicitly call gnutar
instead.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With